SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Comparing output from Bowtie and BWA maasha Bioinformatics 15 10-25-2012 05:56 AM
Velvet Assembler: expected coverage versus estimated coverage versus effective covera DMCH Bioinformatics 1 11-30-2011 04:21 AM
Bowtie vs BWA sarbashis Illumina/Solexa 12 08-26-2011 04:46 AM
Bowtie vs BWA: only 50% overlapping SNPs a11msp Bioinformatics 4 10-14-2010 03:22 AM
bowtie vs bwa + samtools == confusion lletourn Bioinformatics 10 06-11-2010 04:06 AM

Reply
 
Thread Tools
Old 11-01-2011, 02:10 PM   #1
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default Bowtie 2 versus BWA

Has anyone compared the speed of BWA and Bowtie 2? How about the accuracy for both point mutation and indels?
adaptivegenome is offline   Reply With Quote
Old 11-01-2011, 03:41 PM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Yes, user lh3 has done some analysis: http://lh3lh3.users.sourceforge.net/alnROC.shtml
nilshomer is offline   Reply With Quote
Old 11-04-2011, 04:32 PM   #3
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Thanks Nils. Do you by chance have any knowledge of the speed difference? Are they roughly the same?
adaptivegenome is offline   Reply With Quote
Old 11-05-2011, 06:53 AM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

No I don't have an idea.
nilshomer is offline   Reply With Quote
Old 11-05-2011, 07:24 AM   #5
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Updated to bowtie2-beta3 and added timing. If you wonder why the sensitivity in the plot is different from that in the bowtie2 poster, that is because 1) bwa-short is indeed not very sensitive on real single-end data without trimming; bwa-sw is much better; 2) That poster is counting all alignments, but I am counting "unique" alignments only. Bowtie2 can map many reads, but it has difficulty in distinguishing good and bad hits and thus give many good hits low mapping quality. Beta3 is much better than beta2 at this point, but still not perfect.

Basically bowtie2 chooses a nice balance point where it is the fastest without much loss of accuracy in comparison to others, but for variant calling for Illumina data, novoalign/smalt/bwa/gsnap may still be the mapper of choice. Things may change in future of course. Bowtie2 is still in beta, while bwa and bwa-sw are mature (i.e. not many improvements can be made).

Last edited by lh3; 11-05-2011 at 08:14 AM.
lh3 is offline   Reply With Quote
Old 11-05-2011, 08:22 AM   #6
salzberg
Member
 
Location: Baltimore

Join Date: Nov 2011
Posts: 11
Default

msg deleted

Last edited by salzberg; 11-05-2011 at 08:29 AM. Reason: posted before completion
salzberg is offline   Reply With Quote
Old 11-05-2011, 08:28 AM   #7
salzberg
Member
 
Location: Baltimore

Join Date: Nov 2011
Posts: 11
Default

In fact, we have done extensive comparisons of Bowtie2 versus both BWA and BWA-SW. Across multiple parameter settings for both tools, we found that Bowtie2 is (a) faster and (b) more sensitive than both programs. We tested it on 2,000,000 human reads, paired and unpaired, from an Illumina HiSeq instrument. I would note that the test by user lh3 (Heng Li, the author of BWA) used only simulated reads, and only 200,000 of them. Our tests were larger and more realistic.

We have detailed figures that Ben Langmead just presented at the Genome Informatics conference. I can't post the figures here, which contain dozens of experiments, but I will just post a few points showing performance using the default settings of Bowtie2 and BWA (and SOAP2):
Aligner Options Running time % reads aligned Mem(GB)
Bowtie2 --sensitive 11m:17s 96.94% 2.3
BWA -k 2 -l 32 -o 1 30m:52s 91.80% 2.4
SOAP2 -l 256 -v 5 -g 0 5m:08s 84.43% 5.3


As you can see, Bowtie2 aligned 5% more of the reads than BWA, and was 3 times faster.

We also compared Bowtie2 to BWA-SW on Ion Torrent and 454 reads, which contain many indels. Bowtie2 was superior to BWA-SW on both speed and sensitivity for a wide range of parameter settings of both programs.

We also compared the accuracy of both BWA and Bowtie on human reads in a simulation using 3 million paired and unpaired 75 bp Illumina reads, simulated so we knew the "truth". Note that this is 30 times more data than lh3's simulated results on his website. Our findings were that Bowtie2 aligned approximately 3% more reads correctly from unpaired reads, and approximately 1% more reads correctly from paired reads. This test used default parameters of both programs.

Thus in our tests, Bowtie2 is faster, more sensitive, and more accurate than BWA across a wide range of parameter settings.

Last edited by salzberg; 11-05-2011 at 08:33 AM.
salzberg is offline   Reply With Quote
Old 11-05-2011, 08:46 AM   #8
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

@salzberg

You still avoid talking about "unique" alignments. For the seeding strategy like bowtie2, it is trivial to find a hit. But as I said, a key flaw in bowtie2 as well as bowtie1 is that sometimes it is unable to distinguish unique hits and repetitive hits and thus give low mapping quality to unique hits. It is more sensitive to a hit, but not sensitive to a unique hit. Also for 100bp single-end reads, the bowtie2 equivalence is really bwa-sw, not bwa-short; for paired-end reads, BWA-short will gain a lot of sensitivity and be much more accurate. Users like 1000g/sanger/broad also enable trimming on real data, though this seems unfair to bowtie2 and bowtie2 should still outperform in terms of overall sensitivity.

I believe I am usually fair in all benchmarks even involving my own programs. In my benchmark, bwa/bwa-sw is clearly not the best and I am not hiding that at all. I am not trying to make bowtie2 worse.

Perhaps the different result on simulated data is only because the simulation is different. I would love to see a ROC curve, which in my view the most informative plot revealing the overall accuracy (sensitivity vs. specificity) of a mapper. In your post, you were only talking about sensitivity, not specificity.

Last edited by lh3; 11-05-2011 at 09:08 AM.
lh3 is offline   Reply With Quote
Old 11-05-2011, 11:12 AM   #9
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 249
Default

Quote:
Originally Posted by lh3 View Post
@salzberg

You still avoid talking about "unique" alignments.
This is a interesting subtlety. In my experience comparing BWA, bowtie1, and GSNAP, using BWA's wg_sim and wg_sim_eval there was a significant penalty for multimaps, since each alternative mapping was considered a miss, and BWA had a devious algorithm which cut multimaps off after 11 hits, and simply reported it as too ambiguous. However, when the evaluation code was rewritten to only count a multimap as one miss(rather than multiple misses), BWA was still superior to bowtie1 or GSNAP. GSNAP in particular was bad about reporting multimaps.

I am not sure that there is any special application that requires a very sensitive aligner, with lots of false positives.
rskr is offline   Reply With Quote
Old 11-05-2011, 01:09 PM   #10
salzberg
Member
 
Location: Baltimore

Join Date: Nov 2011
Posts: 11
Default

@lh3:
>>I believe I am usually fair in all benchmarks even involving my own programs. In my
>>benchmark, bwa/bwa-sw is clearly not the best and I am not hiding that at all. I am not
>>trying to make bowtie2 worse.

I understand that you believe you were being fair. But a single test using 100,000 error-free reads is rather unrealistic. Our tests on real data showed very different results from yours. Our tests on simulated data (not error-free, though) also showed very different results, so I'm not sure how you measured false positives. Given that there are billions of real reads now available, I think there's no reason not to do tests on real data as well.

The notion of "correct" mapping for multi-reads is a subtle one that many users don't care about: i.e., finding just the right mapping for a read that maps to 10, 100, or 1000 places doesn't really matter for most applications, even if it is possible to find such a mapping. My guess is that other than repetitive reads, all the aligners generally get the mappings right - and then the issue is whether they can find a mapping if the reads have errors and polymorphisms, which is what users do care about.
salzberg is offline   Reply With Quote
Old 11-05-2011, 03:18 PM   #11
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 249
Default

Quote:
Originally Posted by salzberg View Post
@lh3:
My guess is that other than repetitive reads, all the aligners generally get the mappings right
I disagree. If you look at hash based aligners there are certain patterns of indels, mismatches and errors, where they won't find the right result even if it is unique. For example if the word size is 15, and there are are two mismatches 10 bases apart in a 50mer, the hash won't return the region at all. Likewise for longer reads the number of mismatches is likely to be higher and the Suffix Array search will terminate before finding the ideal match.
rskr is offline   Reply With Quote
Old 11-05-2011, 04:52 PM   #12
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I never do simulation with error-free reads. The reads in my simulation contain variants, which is equivalent to 1% SNP+INDEL error rate. 100k reads are enough for investigating specificity around 0.01% - we still have 100 wrong mappings, so the variance is pretty small. Also, I have run simulations for tens of millions of reads. The relative performance of novoalign, bwa and bwa-sw always stays the same. I also wanted to use real data, but it is hard to evaluate specificity on real data because there is no ground truth. One of the viable measurements is described the bwa-sw paper, but it is quite complicated to apply in practice to multiple mappers.

Nearly all aligners use heuristics. Few of them can guarantee to find the best hit even if the top hit is clearly (i.e. in all sensible scoring schemes) better than other hits. Here are several examples. In the following table, each line consists of bowtie2 position, bowtie2 XM:XO:XG, correct position, bwa samse XM:XO:XG and bwa-sw AS:XS (these examples also prove that my simulation is not error-free):


9:134616048 7:0:0 1:12746267 2:0:0 (bwa-sw wrong)
17:5319362 7:1:1 1:28924148 1:1:1 88:77
X:70135101 7:0:0 1:185975011 2:0:0 76:72
1:153251402 4:1:1 2:116348184 2:1:1 85:77
19:42604275 8:0:0 5:178218515 3:0:0 (bwa-sw wrong)
4:260872 6:1:1 7:129633785 0:1:1 92:76


All these reads do not have multiple hits, but you can see that bowtie2 misses the optimal position and chooses a position with more mismatches/gaps. I am not using these examples to argue bwa is more accurate -- I can of course find examples where bowtie2 does a better job than bwa -- what I want to argue is that even for "unique" hits, different mappers give different answers. Finding the "unique" hits is a really hard task. We cannot assume all mappers created with the same specificity. The ROC curve has shown this already.

As to the differences between your and my evaluations, I think they mainly come from two aspects: 1) for sensitivity, I am only counting hits with mapping quality greater than 0-3 (depending on mappers), but you are counting all hits including mapQ=0 hits; 2) I am evaluating specificity, while all your measurements are essentially sensitivity. Your conclusion is not inconsistent with mine. We just have different focuses. If I follow the same philosophy of yours, I am sure I will come to your conclusion with my 100k SE/PE reads/pairs, but I believe specificity and sensitivity of hits clearly having optimal positions are more important to accuracy-critical applications like variant calling and the discovery of structural variations.

EDIT: genericforms reminds me that there is still a question about what accuracy is enough. I do not know the definite answer. It is possible that the difference between two mappers is so subtle that we do not observe differences in SNP/INDEL calls from real data, though my very limited experience seems to suggest the contrary. I could be wrong at the point.

Last edited by lh3; 11-06-2011 at 07:59 AM.
lh3 is offline   Reply With Quote
Old 11-05-2011, 05:10 PM   #13
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

We are going to try a comparison as well. When we compare mappers on the basis of proper read placement we plot TP/FP and we do this for different MapQs.

I agree with Heng Li in that users will be interested in recall rates for point mutations as well indels of different sizes. So we will explicitly examine this as well. I will let you guys know what we find.

Last edited by adaptivegenome; 11-05-2011 at 05:16 PM. Reason: typo
adaptivegenome is offline   Reply With Quote
Old 11-05-2011, 07:12 PM   #14
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 249
Default

Quote:
Originally Posted by lh3 View Post
EDIT: genericforms reminds me that there is still a question about what accuracy is enough. I do not know the definite answer. It is possible that the difference between two mappers is so subtle that we do not observe differences in SNP/INDEL calls from real data, though my very limited experience seems to suggest the contrary. I could be wrong at the point.
A bit of a naive question. In the context of a scientific project mapping is merely a single step, and if you have compounding errors over multiple steps. From dna/rna collection to library prep to sequencing, base calling, mapping, SNP calling and so on. It is pertinent that every step be as accurate as possible so as not to impose limitations on subsequent experiments, computations, analysis, interpretations etc. Unfortunately it appears that mapping is far behind the state of the art of accuracy in sequencing technologies.
rskr is offline   Reply With Quote
Old 11-06-2011, 05:03 PM   #15
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

So Bowtie is definitely faster and we are able to reproduce the sensitivity gain, however if you account for false-positives, BWA clearly wins out. We simulated a fly genome (120MB) and 15X coverage and 100bp reads. There was a 0.1% mutation rate including 10% indels. The indels ranged from 1 to 10 bases.

So salzberg, what would be helpful is if you can try to reproduce what we have done with your 2million human reads. Tell me if you find a similar result.
adaptivegenome is offline   Reply With Quote
Old 11-07-2011, 09:56 AM   #16
salzberg
Member
 
Location: Baltimore

Join Date: Nov 2011
Posts: 11
Default

@lh3 (Heng Li): you above "I never do simulation with error free reads." Yet you wrote on your webpage that you "simulate error free reads from the diploid genome." That is why I pointed out that you used error-free reads - you said so yourself.

@genericforms: you assert without proof that BWA "clearly wins out" if you account for false positives. Our results contradict this. We simulated both sequencing error (using the ART simulator v1.1.5) and the results of variation between individuals in our experiments, using 3 million paired-end reads. Bowtie2 assigned more reads to their true point of origin than BWA.

We have submitted our results in a paper which is in the peer review process right now. I encourage both of you to do the same. Un-refereed claims on this forum are little more than anecdotes (which is true of my comments too, of course, so I won't be posting any more).

Meanwhile I encourage everyone to try Bowtie2, which in our experiments has demonstrated unparalleled speed, sensitivity, and accuracy.
salzberg is offline   Reply With Quote
Old 11-07-2011, 10:08 AM   #17
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Salzberg,

Please look at my post. I asked you to confirm our results in your simulation studies. I understand that you simulated 3 million reads. We simulated around 22 million reads. This alone could explain the difference.

Rather than being hostile, try and see if you can reproduce our results. I would be interested to resolve why our results vary. I posted the parameters we tried. Give it a try and let me know what you find.

Also, I think SEQanswers is a great place to post these results because they can be verified and vetted by the entire community, not just a couple reviewers. And instead of waiting months for your paper, we can all work together to solve this problem today. So really I disagree that SEQanswers is an inappropriate place to discuss this work.

Having said that, please do post your paper when it becomes available as I am interested to see what you report.
adaptivegenome is offline   Reply With Quote
Old 11-07-2011, 12:15 PM   #18
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Steven, the sentence following "error free" explains it: "Although reads are error free, many reads cannot be perfectly mapped to the reference genome due to the presence of variations." (This sentence was on the very first version of that webpage.)

Perhaps you are still mixing overall sensitivity with sensitivity to unique hits and specificity. It is probably my problem of not explaining it clearly. As many others are also reading this thread, I will try to do better. I only compare bwa-sw and bwa-short to avoid sensitive issues.

I have known for a long time that on single-end 100bp real data, bwa-sw almost always correctly maps more reads than bwa-short. However, as bwa-sw does not have sufficient power to distinguish a good and a bad hit, it has to assign low mapping quality to a lot of perfectly "unique" hits to avoid giving too many high-quality false alignments. The effect is if we run a SNP caller, we sometimes call more correct SNPs from the bwa-short alignment than from bwa-sw, although bwa-sw maps much more reads. To this end, the sensitivity is only meaningful to real applications when the mapper has the ability to disambiguate good and bad hits. Bwa-sw is much more sensitive than bwa-short overall, but not always more sensitive to real applications (EDIT: bwa-sw may have better specificity for 100bp SE data, though).

For variant calling, sensitivity is actually not the major concern. We have already dropped several percent of reads in repetitive regions and filtered tens percent of reads with the Illumina pipeline, it does not hurt too much if we have a marginally higher false negative rate. The sensitivity is even less of a concern given deep sequencing because the coverage compensates the missing alignments due to excessive sequencing errors. In contrast, specificity is much more important especially given that mapping errors tend to be recurrent: if we wrongly map one read, we are likely to wrongly map other reads in the same region affected by the same true variants. The mere sequencing coverage may not help greatly to correct wrong variant calls caused by mapping errors. It is to me critical to evaluate specificity which you have not talked about much in your posts. Note that to evaluate specificity, we have to count the fraction of reads misplaced out of mapped reads. The overall number of correctly mapped reads has little to do with specificity. If a mapper maps more correct reads but also much more wrong reads, it is still a mapper with low specificity. Take bwa-sw and bwa-short as an example again. If reads have low quality tail, bwa-sw can even map more reads than bwa-short given paired-end reads, but I know for sure that bwa-short will greatly outperform bwa-sw in terms of specificity because bwa-sw does not use the pairing information to correct wrong alignments while bwa-short does.

Again, as I revisited the whole thread, I think we are just focusing on different measurements. We are both correct on the measurements we are interested in. Genericforms actually confirms both of us.

IMHO, being peer-reviewed does not always mean to be more correct. If I really want to write a paper on this evaluation, I am sure with my track of record I can get it published, but this does not make me more correct than you or others. My previous evaluations on maq/bwa/bwa-sw were all flawed if I look back (I was thinking the evaluations were the best possible at the time of writing the manuscripts, but I was wrong), but they have all been accepted. My review on alignment algorithms uses a similar ROC plot, it gets peer-reviewed and published, too.

Actually 1000g took similar procedure to evaluate read mappers about 2 years ago. I was not involved except suggesting measurements (simulation, evaluation and program running were all done by others). In some way, this is better than peer-review in that the measurement has been reviewed by many more. Also, in my benchmark, the whole procedure is open sourced and every command line is given. Everyone can try by themselves to validate if I am biased, wrong or lying. Many published papers do not have reproducibility of this level.

Given that I always think you are correct on the measurements you are using, I will also stop posting, too. This discussion is very helpful to me. Thank you.

Last edited by lh3; 11-07-2011 at 01:17 PM. Reason: Correct grammatical errors; mention illumina pipeline
lh3 is offline   Reply With Quote
Old 11-07-2011, 12:19 PM   #19
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by salzberg View Post
We have submitted our results in a paper which is in the peer review process right now. I encourage both of you to do the same. Un-refereed claims on this forum are little more than anecdotes (which is true of my comments too, of course, so I won't be posting any more).
You don't see this kind of online discussion as part of the future of peer review then?
maubp is offline   Reply With Quote
Old 11-07-2011, 01:36 PM   #20
salzberg
Member
 
Location: Baltimore

Join Date: Nov 2011
Posts: 11
Default

hi Heng:
I appreciate your clarifications which are helpful.

I do want to mention that you are using "specificity" where I am pretty sure you mean "precision". (This is a widespread problem in the field - but I'm trying to correct it where I can.) E.g., you wrote: "If a mapper maps more correct reads but also much more wrong reads, it is still a mapper with low specificity." The definition of specificity is:
number of true negatives/(num of true negatives + num of false positives)
A "true negative" in the short-read alignment world is not very well defined, but we could define it as not aligning a read that doesn't belong to the genome at all. In any case, that's not what you mean.

Precision is defined as TP/(TP+FP). So I think you mean "precision" in what you are describing.

We know that Bowtie2 is not perfect - far from it! But we think it is a substantial improvement over Bowtie1. Ben Langmead has already made some changes (just this past week) to improve Bowtie2's accuracy. We'll keep at it.
salzberg is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO