SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bowtie2 --all not reporting multiple hits rflrob Bioinformatics 0 10-17-2012 04:00 PM
Bowtie2 and multiple repetitive hits henrik_licht RNA Sequencing 0 05-22-2012 03:25 AM
BWA Multiple Hits EthoStrain Bioinformatics 4 02-20-2012 07:02 AM
Multiple hits from BLAT? thsuk1 Bioinformatics 2 09-21-2010 12:13 AM
Maq multiple hits file Lesley Bioinformatics 1 12-09-2009 03:12 AM

Reply
 
Thread Tools
Old 10-19-2012, 05:46 AM   #1
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 268
Default ATTENTION: bowtie2 and multiple hits

We just tested Bowtie2 for reads mapping multiple times in a genome:

1 . We took one read with multiple (perfect) mapping loci in the genome (hg19):

FASTQ entry:

Code:
@test_sequence
CAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
UCSC - BLAT:

Code:
browser details YourSeq          101     1   101   101 100.0%     2   -  114359280 114359380    101
browser details YourSeq          101     1   101   101 100.0%    15   -  102519435 102519535    101
browser details YourSeq          101     1   101   101 100.0%     1   +      11635     11735    101
2 . We created a fastq file containing 1,000 times this entry.

3 . We started bowtie2 with the following command:

Code:
bowtie2 -x hg19 -U test.fq -S test.sam
4 . bowtie2 claims in its manual:

Quote:
By default, bowtie2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it continues looking for alignments that are nearly as good or better. The best alignment found is reported (randomly selected from among best if tied).
5 . BUT:

=> For all reads bowtie2 reports the same location (chr1:11635-11735) in the SAM output

=> Using default parameters, bowtie2 does NOT select the reported mapping randomly

------------------------------------------------

Why is that a problem?

Two examples:

Assume you analyze a small RNA-seq experiment to quantify microRNAs. Now you have an example like hsa-let-7a. The 5' arm of this microRNA (hsa-let-7a-5p) occurs three times in the human genome, while the 3' arm (hsa-let-7a-3p) is unqiue. When you now use bowtie2 for mapping (without knowing the effect shown above), you might think, that your hsa-let-7a-5p reads are randomly distributed over these three loci, normalizing the "expression" in a way. But that is not the case. Instead one of the three let-7 microRNAs gets all reads and you think it is highly expressed. But is it?

The same problem occurs when doing transcriptome sequencing. It might happen, that all the reads of a gene having pseudogene copy in the genome, map to the latter, resulting in no expression at all. Is it not expressed?

TIP: You can and should force bowtie2 to report all multiple loci (-k or -a parameter). Do not use the bowtie2 default parameter for NGS read mapping!


Also follow the disccusion here
Thanks to Christian Otto (transcriptome bioinformatics group - University Leipzig)!
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 10-19-2012, 10:16 AM   #2
aloraine
Junior Member
 
Location: North Carolina, USA

Join Date: May 2010
Posts: 5
Default

TopHat 2 runs bowtie2. Does it run bowtie2 w/ -k or -a parameter?
aloraine is offline   Reply With Quote
Old 10-19-2012, 11:54 AM   #3
ugolino
Member
 
Location: maryland, usa

Join Date: Oct 2011
Posts: 14
Default

Following your tip I ran a PE RNA-seq alignment with these options.
bowtie2 --no-contain -a --no-mixed

Then fed SAM output file to htseq-count (w/ -s no option as it's a non-stranded library) and got ~ 6x more counts than there are PE reads to begin with. Apparently, htseq-count tallied every alignment as unique, not ambiguous, which makes me think that the SAM output from bowtie2 does not flag multiple alignments properly or htseq-count does not process them ?
ugolino is offline   Reply With Quote
Old 10-22-2012, 10:10 AM   #4
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 268
Default

Quote:
Originally Posted by aloraine View Post
TopHat 2 runs bowtie2. Does it run bowtie2 w/ -k or -a parameter?
There are also several other tools that map the reads with bowtie1/2. It would be nice to check they use the -k paramter and if not, any bias/error is added to their results.

To name some of these tools:
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 11-01-2012, 12:54 AM   #5
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

As of yesterday night a new version of Bowtie 2 has been released (Bowtie 2 version 2.0.2) which adds an option --non-deterministic:

Code:
*    Added --non-deterministic option, which better matches how some users
     expect the pseudo-random generator inside Bowtie 2 to work.  Normally, if
     you give the same read (same name, sequence and qualities) and --seed, you
     get the same answer.  --non-deterministic breaks that guarantee.  This can
     be more appropriate for datasets where the input contains many identical
     reads (same name, same sequence, same qualities).
Does this option address the multiple mapping issue reported by original poster?
fkrueger is offline   Reply With Quote
Old 11-01-2012, 05:42 AM   #6
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 268
Default

That is a very good point.

I actually like this statement:
Quote:
...which better matches how some users expect the pseudo-random generator...
How do the bowtie developer expect a random generator to work?


But, in my opinion, the problem still exists.
  • When I run bowtie2 with default parameters, I still get one non-random hit for all reads.
  • All tools that use bowtie2 (e.g. tophat2) for mapping, do not use --non-deterministic, or -a/-k parameter.
So, downstream analysis will still be affected.

Why not set it as default parameter? Time complexity?
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 11-02-2012, 06:34 AM   #7
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 268
Default

We now checked the behavior of the new version of bowtie2 (v.2.0.2) with our testset from above (3 equivalent mapping loci):

1 .
(1) no --non-deterministic option + (2) same name for all reads:
100% of all reads map to one locus.

2 .
(1) no --non-deterministic option + (2) different name for all reads:
99% of all reads map to one locus.
1% of the reads map to on of the other loci.

3 .
(1) --non-deterministic option + (2) same name for all reads:
65% of the reads map to loci1
19% of the reads map to loci2
16% of the reads map to loci3

4 .
(1) --non-deterministic option + (2) different name for all reads:
same result as in 3.
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 11-06-2012, 09:44 AM   #8
knchess
Junior Member
 
Location: Toronto

Join Date: Nov 2012
Posts: 4
Default

Quote:
Originally Posted by ecSeq Bioinformatics View Post
We now checked the behavior of the new version of bowtie2 (v.2.0.2) with our testset from above (3 equivalent mapping loci):

1 .
(1) no --non-deterministic option + (2) same name for all reads:
100% of all reads map to one locus.

2 .
(1) no --non-deterministic option + (2) different name for all reads:
99% of all reads map to one locus.
1% of the reads map to on of the other loci.

3 .
(1) --non-deterministic option + (2) same name for all reads:
65% of the reads map to loci1
19% of the reads map to loci2
16% of the reads map to loci3

4 .
(1) --non-deterministic option + (2) different name for all reads:
same result as in 3.
From bowtie2's manual:

Quote:
The pseudo-random number generator is re-initialized for every read, and the seed used to initialize it is a function of the read name, nucleotide string, quality string, and the value specified with --seed. If you run the same version of Bowtie 2 on two reads with identical names, nucleotide strings, and quality strings, and if --seed is set the same for both runs, Bowtie 2 will produce the same output; i.e., it will align the read to the same place, even if there are multiple equally good alignments. This is intuitive and desirable in most cases. Most users expect Bowtie to produce the same output when run twice on the same input.
So the result for 1. is expected. It's also desirable because otherwise alignment runs can't be parallelized. But I agree that --non-deterministic doesn't fix anything because as 2. shows, different names still get biased hits. Deterministic should still be random, which is what the manual also claims. Even --non-deterministic doesn't give random hits, as 3. and 4. show..

I actually got the same results as 3. and 4. without using --non-deterministic by completely randomizing the read names, so even random read names still give bias. It seems the more similar the read names, the worse the bias.
knchess is offline   Reply With Quote
Old 11-07-2012, 05:52 AM   #9
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 268
Default

Quote:
... it will align the read to the same place, even if there are multiple equally good alignments. This is intuitive and desirable in most cases. Most users expect Bowtie to produce the same output when run twice on the same input.
I think that might be intuitive, but by far not random.

I think this discussion is really fruitful, since it is really important to understand what results are produced by widely used mapping tools. Sometimes developer assume that some behavior is "intuitive and desirable", but the users actually expect a completely different one, ending up with a strong bias.


To sum it up:
I suggest, when using bowtie2, users should set the -a parameter to assure, that they will not get a bias. Or know and accept the bias... of course!
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 11-07-2012, 10:10 AM   #10
knchess
Junior Member
 
Location: Toronto

Join Date: Nov 2012
Posts: 4
Default

I agree that knowing what the tools are really doing is important. Ideally some communication from the developers would help. There doesn't seem to be a mailing list for bowtie though. Are they aware of the problem?

What do you do downstream if you use the -a parameter? Randomly choose one of the reported hits manually? If each identical loci got 1 reported hit for a matching read, then that itself is also a bias (compared to a read that only has one loci).

bwa actually reports all hits by default. But the nice thing is the 'primary' hit reported for a SAM entry is actually random (so about 33/33/33 for the experiment done in OP) while still been deterministic even if the read names are identical. Then you can just limit reported hits to 1 to get the behaviour that bowtie2 is supposed to have by default. Unfortunetely, bwa has its own issues...
knchess is offline   Reply With Quote
Old 11-07-2012, 11:07 AM   #11
ecSeq Bioinformatics
Senior Member
 
Location: Leipzig, Germany

Join Date: May 2012
Posts: 268
Default

Quote:
Are they aware of the problem?
I reported the bug at http://sourceforge.net/tracker/?func...7&atid=1101606



Quote:
What do you do downstream if you use the -a parameter?
In my opinion, you can do 3 different things for downstream analysis:
  1. Discard all reads with multiple hits
    For a DE analysis it should be ok, since you delete them for all runs. But you will underestimate, or even completely miss the expression of all genes with multiple copies.
  2. Take all hits
    For a DE analysis it should be ok, since you do the same for all runs and you won't miss genes with multiple copies. But you will overestimate the expression of all genes with multiple copies.
  3. Take all hits, but divide the "expression impact" of the read by the number of its mapping loci
    I think this is the prefered way, since you do not know where the read actually comes from. You have all mappings, but by dividing the impact, you will "punish" the hit and equally distribute its impact in expression (it's somehow the same result as the random generator, but in my opinion much cleaner). That is actually, how we handle this issue by default. This way, you minimize the error and don't loose any region, where the read might come from. Another benefit of this approach is the possibility that you can always access the mapping loci at a later time and filter out multiple hits, or change the way of how you want to normalize them. Of course, most of the downstream tools are not able to handle this normalization, but we use our own tools for that.


For mapping we use segemehl. With this tool, you get all multiple hits by default AND it sets the NH-TAG in the sam file, which gives you the number of multiple hits for the read. That allows you do directly "normalize the expression" of the read.

This software does also allow mismatches and indels in the seed, which in my view is very important and still a problem of almost all mapping tools.
__________________
ecSeq Bioinformatics is Europe’s leading provider of hands-on bioinformatics workshops and professional data analysis in the field of Next-Generation Sequencing (NGS).
ecSeq Bioinformatics is offline   Reply With Quote
Old 01-09-2013, 11:36 AM   #12
yangjr
Junior Member
 
Location: Ann Arbor

Join Date: Apr 2010
Posts: 6
Default

Quote:
[*]Take all hits, but divide the "expression impact" of the read by the number of its mapping loci
A better but more complex way is
1. Ask bowtie to generate all hits.
2. Take the hits that are uniquely best, estimate a "hit density" for single site
3. Take the multiple hits, divide their "expression impact" according to the "hit density" of its mapping loci (or a flanking window around the loci). For example, one tag which is sequenced 100 times, and it hits 3 loci equally good, and that 3 loci have a hit density of a, b and c according to the unique hits. Then the "expression impact" contributed by this tag would be 100*a/(a+b+c), 100*b/(a+b+c) and 100*c/(a+b+c) respectively.
yangjr is offline   Reply With Quote
Old 01-10-2013, 08:24 AM   #13
syfo
Just a member
 
Location: Southern EU

Join Date: Nov 2012
Posts: 103
Default

Quote:
Originally Posted by ecSeq Bioinformatics View Post
That is a very good point.
I actually like this statement:

Code:
*    Added --non-deterministic option, which better matches how some users
     expect the pseudo-random generator inside Bowtie 2 to work.
How do the bowtie developer expect a random generator to work?
As all computationally-based random generators do.

I think there might be some confusion about the term "random" here. Computational methods for the generation of "random" numbers do not pick up numbers at random but use algorithms to generate long series of apparently random numbers, which are in fact completely determined by an initial number called "seed". Hence the "pseudo-random". The seed is often hard-coded or built from the input. This is why a given input will always give the same output, which is actually more than expected (reproducibility is crucial for both developpers and users -just consider testing/debugging for instance).

What can be done to give a better illusion of randomness is to use a "random" -i.e. variable- seed: the generator then combines some extremely variable and unpredictable data (like the current time, the ID given by the computer to the process or some other dynamic system information) to compute a different seed at each execution, which will produce a different series of numbers each time. I bet that this is exactly what the non-deterministic option is doing.
syfo is offline   Reply With Quote
Old 02-07-2014, 05:07 PM   #14
raphael123
Member
 
Location: Mc Gill -- Montreal

Join Date: Dec 2013
Posts: 37
Default

Very usefull discussion !

Syfo you are rigth but it is not explaining this difference :

Quote:
(1) no --non-deterministic option + (2) same name for all reads:
100% of all reads map to one locus.

(1) --non-deterministic option + (2) same name for all reads:
65% of the reads map to loci1
19% of the reads map to loci2
16% of the reads map to loci3
knchess don t you think it depend how much you try to scan the full genome ? Is the first locus quicker to access in the genome ?

Last edited by raphael123; 02-07-2014 at 05:08 PM. Reason: New Idea
raphael123 is offline   Reply With Quote
Old 02-10-2016, 10:54 AM   #15
clintp
Member
 
Location: Georgia

Join Date: Apr 2013
Posts: 19
Default

Has there been any update to this issue?
clintp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO