Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ATTENTION: bowtie2 and multiple hits

    We just tested Bowtie2 for reads mapping multiple times in a genome:

    1 . We took one read with multiple (perfect) mapping loci in the genome (hg19):

    FASTQ entry:

    Code:
    @test_sequence
    CAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAA
    +
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    UCSC - BLAT:

    Code:
    browser details YourSeq          101     1   101   101 100.0%     2   -  114359280 114359380    101
    browser details YourSeq          101     1   101   101 100.0%    15   -  102519435 102519535    101
    browser details YourSeq          101     1   101   101 100.0%     1   +      11635     11735    101
    2 . We created a fastq file containing 1,000 times this entry.

    3 . We started bowtie2 with the following command:

    Code:
    bowtie2 -x hg19 -U test.fq -S test.sam
    4 . bowtie2 claims in its manual:

    By default, bowtie2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it continues looking for alignments that are nearly as good or better. The best alignment found is reported (randomly selected from among best if tied).
    5 . BUT:

    => For all reads bowtie2 reports the same location (chr1:11635-11735) in the SAM output

    => Using default parameters, bowtie2 does NOT select the reported mapping randomly

    ------------------------------------------------

    Why is that a problem?

    Two examples:

    Assume you analyze a small RNA-seq experiment to quantify microRNAs. Now you have an example like hsa-let-7a. The 5' arm of this microRNA (hsa-let-7a-5p) occurs three times in the human genome, while the 3' arm (hsa-let-7a-3p) is unqiue. When you now use bowtie2 for mapping (without knowing the effect shown above), you might think, that your hsa-let-7a-5p reads are randomly distributed over these three loci, normalizing the "expression" in a way. But that is not the case. Instead one of the three let-7 microRNAs gets all reads and you think it is highly expressed. But is it?

    The same problem occurs when doing transcriptome sequencing. It might happen, that all the reads of a gene having pseudogene copy in the genome, map to the latter, resulting in no expression at all. Is it not expressed?

    TIP: You can and should force bowtie2 to report all multiple loci (-k or -a parameter). Do not use the bowtie2 default parameter for NGS read mapping!


    Also follow the disccusion here
    Thanks to Christian Otto (transcriptome bioinformatics group - University Leipzig)!
    ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

  • #2
    TopHat 2 runs bowtie2. Does it run bowtie2 w/ -k or -a parameter?

    Comment


    • #3
      Following your tip I ran a PE RNA-seq alignment with these options.
      bowtie2 --no-contain -a --no-mixed

      Then fed SAM output file to htseq-count (w/ -s no option as it's a non-stranded library) and got ~ 6x more counts than there are PE reads to begin with. Apparently, htseq-count tallied every alignment as unique, not ambiguous, which makes me think that the SAM output from bowtie2 does not flag multiple alignments properly or htseq-count does not process them ?

      Comment


      • #4
        Originally posted by aloraine View Post
        TopHat 2 runs bowtie2. Does it run bowtie2 w/ -k or -a parameter?
        There are also several other tools that map the reads with bowtie1/2. It would be nice to check they use the -k paramter and if not, any bias/error is added to their results.

        To name some of these tools:
        ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

        Comment


        • #5
          As of yesterday night a new version of Bowtie 2 has been released (Bowtie 2 version 2.0.2) which adds an option --non-deterministic:

          Code:
          *    Added --non-deterministic option, which better matches how some users
               expect the pseudo-random generator inside Bowtie 2 to work.  Normally, if
               you give the same read (same name, sequence and qualities) and --seed, you
               get the same answer.  --non-deterministic breaks that guarantee.  This can
               be more appropriate for datasets where the input contains many identical
               reads (same name, same sequence, same qualities).
          Does this option address the multiple mapping issue reported by original poster?

          Comment


          • #6
            That is a very good point.

            I actually like this statement:
            ...which better matches how some users expect the pseudo-random generator...
            How do the bowtie developer expect a random generator to work?


            But, in my opinion, the problem still exists.
            • When I run bowtie2 with default parameters, I still get one non-random hit for all reads.
            • All tools that use bowtie2 (e.g. tophat2) for mapping, do not use --non-deterministic, or -a/-k parameter.

            So, downstream analysis will still be affected.

            Why not set it as default parameter? Time complexity?
            ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

            Comment


            • #7
              We now checked the behavior of the new version of bowtie2 (v.2.0.2) with our testset from above (3 equivalent mapping loci):

              1 .
              (1) no --non-deterministic option + (2) same name for all reads:
              100% of all reads map to one locus.

              2 .
              (1) no --non-deterministic option + (2) different name for all reads:
              99% of all reads map to one locus.
              1% of the reads map to on of the other loci.

              3 .
              (1) --non-deterministic option + (2) same name for all reads:
              65% of the reads map to loci1
              19% of the reads map to loci2
              16% of the reads map to loci3

              4 .
              (1) --non-deterministic option + (2) different name for all reads:
              same result as in 3.
              ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

              Comment


              • #8
                Originally posted by ecSeq Bioinformatics View Post
                We now checked the behavior of the new version of bowtie2 (v.2.0.2) with our testset from above (3 equivalent mapping loci):

                1 .
                (1) no --non-deterministic option + (2) same name for all reads:
                100% of all reads map to one locus.

                2 .
                (1) no --non-deterministic option + (2) different name for all reads:
                99% of all reads map to one locus.
                1% of the reads map to on of the other loci.

                3 .
                (1) --non-deterministic option + (2) same name for all reads:
                65% of the reads map to loci1
                19% of the reads map to loci2
                16% of the reads map to loci3

                4 .
                (1) --non-deterministic option + (2) different name for all reads:
                same result as in 3.
                From bowtie2's manual:

                The pseudo-random number generator is re-initialized for every read, and the seed used to initialize it is a function of the read name, nucleotide string, quality string, and the value specified with --seed. If you run the same version of Bowtie 2 on two reads with identical names, nucleotide strings, and quality strings, and if --seed is set the same for both runs, Bowtie 2 will produce the same output; i.e., it will align the read to the same place, even if there are multiple equally good alignments. This is intuitive and desirable in most cases. Most users expect Bowtie to produce the same output when run twice on the same input.
                So the result for 1. is expected. It's also desirable because otherwise alignment runs can't be parallelized. But I agree that --non-deterministic doesn't fix anything because as 2. shows, different names still get biased hits. Deterministic should still be random, which is what the manual also claims. Even --non-deterministic doesn't give random hits, as 3. and 4. show..

                I actually got the same results as 3. and 4. without using --non-deterministic by completely randomizing the read names, so even random read names still give bias. It seems the more similar the read names, the worse the bias.

                Comment


                • #9
                  ... it will align the read to the same place, even if there are multiple equally good alignments. This is intuitive and desirable in most cases. Most users expect Bowtie to produce the same output when run twice on the same input.
                  I think that might be intuitive, but by far not random.

                  I think this discussion is really fruitful, since it is really important to understand what results are produced by widely used mapping tools. Sometimes developer assume that some behavior is "intuitive and desirable", but the users actually expect a completely different one, ending up with a strong bias.


                  To sum it up:
                  I suggest, when using bowtie2, users should set the -a parameter to assure, that they will not get a bias. Or know and accept the bias... of course!
                  ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

                  Comment


                  • #10
                    I agree that knowing what the tools are really doing is important. Ideally some communication from the developers would help. There doesn't seem to be a mailing list for bowtie though. Are they aware of the problem?

                    What do you do downstream if you use the -a parameter? Randomly choose one of the reported hits manually? If each identical loci got 1 reported hit for a matching read, then that itself is also a bias (compared to a read that only has one loci).

                    bwa actually reports all hits by default. But the nice thing is the 'primary' hit reported for a SAM entry is actually random (so about 33/33/33 for the experiment done in OP) while still been deterministic even if the read names are identical. Then you can just limit reported hits to 1 to get the behaviour that bowtie2 is supposed to have by default. Unfortunetely, bwa has its own issues...

                    Comment


                    • #11
                      Are they aware of the problem?
                      I reported the bug at http://sourceforge.net/tracker/?func...7&atid=1101606



                      What do you do downstream if you use the -a parameter?
                      In my opinion, you can do 3 different things for downstream analysis:
                      1. Discard all reads with multiple hits
                        For a DE analysis it should be ok, since you delete them for all runs. But you will underestimate, or even completely miss the expression of all genes with multiple copies.
                      2. Take all hits
                        For a DE analysis it should be ok, since you do the same for all runs and you won't miss genes with multiple copies. But you will overestimate the expression of all genes with multiple copies.
                      3. Take all hits, but divide the "expression impact" of the read by the number of its mapping loci
                        I think this is the prefered way, since you do not know where the read actually comes from. You have all mappings, but by dividing the impact, you will "punish" the hit and equally distribute its impact in expression (it's somehow the same result as the random generator, but in my opinion much cleaner). That is actually, how we handle this issue by default. This way, you minimize the error and don't loose any region, where the read might come from. Another benefit of this approach is the possibility that you can always access the mapping loci at a later time and filter out multiple hits, or change the way of how you want to normalize them. Of course, most of the downstream tools are not able to handle this normalization, but we use our own tools for that.



                      For mapping we use segemehl. With this tool, you get all multiple hits by default AND it sets the NH-TAG in the sam file, which gives you the number of multiple hits for the read. That allows you do directly "normalize the expression" of the read.

                      This software does also allow mismatches and indels in the seed, which in my view is very important and still a problem of almost all mapping tools.
                      ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).

                      Comment


                      • #12
                        [*]Take all hits, but divide the "expression impact" of the read by the number of its mapping loci
                        A better but more complex way is
                        1. Ask bowtie to generate all hits.
                        2. Take the hits that are uniquely best, estimate a "hit density" for single site
                        3. Take the multiple hits, divide their "expression impact" according to the "hit density" of its mapping loci (or a flanking window around the loci). For example, one tag which is sequenced 100 times, and it hits 3 loci equally good, and that 3 loci have a hit density of a, b and c according to the unique hits. Then the "expression impact" contributed by this tag would be 100*a/(a+b+c), 100*b/(a+b+c) and 100*c/(a+b+c) respectively.

                        Comment


                        • #13
                          Originally posted by ecSeq Bioinformatics View Post
                          That is a very good point.
                          I actually like this statement:

                          Code:
                          *    Added --non-deterministic option, which better matches how some users
                               expect the pseudo-random generator inside Bowtie 2 to work.
                          How do the bowtie developer expect a random generator to work?
                          As all computationally-based random generators do.

                          I think there might be some confusion about the term "random" here. Computational methods for the generation of "random" numbers do not pick up numbers at random but use algorithms to generate long series of apparently random numbers, which are in fact completely determined by an initial number called "seed". Hence the "pseudo-random". The seed is often hard-coded or built from the input. This is why a given input will always give the same output, which is actually more than expected (reproducibility is crucial for both developpers and users -just consider testing/debugging for instance).

                          What can be done to give a better illusion of randomness is to use a "random" -i.e. variable- seed: the generator then combines some extremely variable and unpredictable data (like the current time, the ID given by the computer to the process or some other dynamic system information) to compute a different seed at each execution, which will produce a different series of numbers each time. I bet that this is exactly what the non-deterministic option is doing.

                          Comment


                          • #14
                            Very usefull discussion !

                            Syfo you are rigth but it is not explaining this difference :

                            (1) no --non-deterministic option + (2) same name for all reads:
                            100% of all reads map to one locus.

                            (1) --non-deterministic option + (2) same name for all reads:
                            65% of the reads map to loci1
                            19% of the reads map to loci2
                            16% of the reads map to loci3
                            knchess don t you think it depend how much you try to scan the full genome ? Is the first locus quicker to access in the genome ?
                            Last edited by raphael123; 02-07-2014, 06:08 PM. Reason: New Idea

                            Comment


                            • #15
                              Has there been any update to this issue?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X