SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
[Important] You can help SEQanswers to get published marcowanger General 0 08-14-2011 07:13 PM
Cufflinks paper published at Nature Biotechnology Cole Trapnell Bioinformatics 4 04-17-2011 02:27 PM
published ngs data and scripts helprhelp General 2 01-06-2011 06:34 AM
Published RNA seq data honey Bioinformatics 3 12-18-2010 10:56 PM
One question to the new published paper about Myrna xinwu Bioinformatics 4 08-23-2010 06:33 PM

Reply
 
Thread Tools
Old 11-11-2009, 12:03 AM   #1
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default BFAST published

The paper describing the algorithm behind BFAST has now been published.

Homer N, Merriman B, Nelson SF.
BFAST: An alignment tool for large scale genome resequencing.
PMID: TBD
PLoS ONE. 2009 4(11): e7767.
http://dx.doi.org/10.1371/journal.pone.0007767

Background
The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25100 base range, in the presence of errors and true biological variation.

Methodology
We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.

Conclusions
We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.
nilshomer is offline   Reply With Quote
Old 11-11-2009, 12:04 AM   #2
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,352
Default

Congrats Nils.
ECO is offline   Reply With Quote
Old 11-11-2009, 08:05 AM   #3
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Congratulations!
I've downloaded it and if it runs as stated in the paper I'm ready to use it especially for resequencing and SNP projects. I'm a bit worried about speed (and memory usage) but, as said, if the error rate is so low...
SAM output is a great feature, I'm running almost aligner-independent now :-)
dawe is offline   Reply With Quote
Old 11-11-2009, 08:55 AM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.

In SNP calling given short reads, alignment global to reads has some advantage. Suppose the 3rd base in a read is a true mutation. Local alignment will clip the read and as a result you see more reads having the reference allele than having the non-reference allele, which leads to strong reference bias. To alleviate this bias, we need to postprocess local alignments.
lh3 is offline   Reply With Quote
Old 11-11-2009, 12:26 PM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lh3 View Post
I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.
What panel are you looking at (sensitivity or false mapping)? Obviously the sensitivity cannot be 100% due to the repetitiveness of the human genome. For false-mapping, with no errors I don't see a >0% false mapping rate in Figure 3.

Quote:
In SNP calling given short reads, alignment global to reads has some advantage. Suppose the 3rd base in a read is a true mutation. Local alignment will clip the read and as a result you see more reads having the reference allele than having the non-reference allele, which leads to strong reference bias. To alleviate this bias, we need to postprocess local alignments.
We don't clip the reads. The "local alignment" we use is a global alignment to a window the candidate location. In other words, we enforce that the entire read must align to some subset of bases in the reference.
nilshomer is offline   Reply With Quote
Old 11-11-2009, 12:30 PM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by dawe View Post
Congratulations!
I've downloaded it and if it runs as stated in the paper I'm ready to use it especially for resequencing and SNP projects. I'm a bit worried about speed (and memory usage) but, as said, if the error rate is so low...
SAM output is a great feature, I'm running almost aligner-independent now :-)
@dawe
The current version allows the user to get rid of the memory problem at the cost of some speed. You use the "-d" option to split the index during index creation. This allows you to run BFAST in 4GB environments etc.

@dawe and @lh3
The assumption that the error rate is below 6% is not valid for ABI SOLiD data. With BFAST's ability to be very sensitive, we regularly observe runs with >10% (color/raw) error rates towards the ends of reads, even though after alignment the base error rate is well below 1% (kudos to 2-base encoding).
nilshomer is offline   Reply With Quote
Old 11-11-2009, 12:33 PM   #7
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Quote:
Originally Posted by nilshomer View Post
What panel are you looking at (sensitivity or false mapping)? Obviously the sensitivity cannot be 100% due to the repetitiveness of the human genome. For false-mapping, with no errors I don't see a >0% false mapping rate in Figure 3.
Unless sensitivity is too bad, I do not care too much about false negatives. I am looking at false positives (2nd column). At 0-mismatch/indel, both bwa and bowtie have ~6% error rate while others have 0%.

Quote:
We don't clip the reads. The "local alignment" we use is a global alignment to a window the candidate location. In other words, we enforce that the entire read must align to some subset of bases in the reference.
This is reassuring. Thanks for the explanation.
lh3 is offline   Reply With Quote
Old 11-20-2009, 02:09 AM   #8
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Congratulations Nils Homer!

I really enjoyed your paper - I also must thank you for the in-depth supplement that really explained the history behind BFAST and other aligners. It shows you put a lot of thought into this.

My only question in regards to BFAST performance, is if you have recently compared it to SSAHA2 ? I know you did not intend to be comprehensive in this paper when comparing algorithms (it is impossible to be comprehensive now with 30+ aligners available), but I feel that SSAHA2 might be a real benchmark to compare against for speed and accuracy. Would you say that BFAST would run faster than SSAHA2 given similar accuracy settings?

Thanks for releasing your software. I will be trying it now.
NGSfan is offline   Reply With Quote
Old 11-20-2009, 03:43 AM   #9
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Congratulations, Nils!

Quote:
Originally Posted by lh3 View Post
I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.
It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.

Anyway, we have had great success using BFAST on whole genome data from the SOLiD platform, so I am really happy to see it finally published and encourage others to try it.
Michael.James.Clark is offline   Reply With Quote
Old 11-20-2009, 04:59 AM   #10
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Quote:
Originally Posted by Michael.James.Clark View Post
It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.
Both bowtie and bwa are able to find all the occurrences of perfect matches and they always know if a perfect alignment is exact repeat or not. They should not produce false positives in theory. In practice, I evaluated them and they do not, at least not as high as 6%. In addition, I am curious why in 3A aligners do not do equally well for perfect hits. Again, maq and soap guarantee to find all perfect hits.

Quote:
What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.
I completely agree that Bfast is a great tool for hard reads. The performance of other aligners look reasonable, which implies bfast is very capable. I just think bowtie and bwa should be posed in the right way to avoid confusion.

Last edited by lh3; 11-20-2009 at 05:02 AM.
lh3 is offline   Reply With Quote
Old 11-20-2009, 11:40 AM   #11
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by NGSfan View Post
Congratulations Nils Homer!

I really enjoyed your paper - I also must thank you for the in-depth supplement that really explained the history behind BFAST and other aligners. It shows you put a lot of thought into this.

My only question in regards to BFAST performance, is if you have recently compared it to SSAHA2 ? I know you did not intend to be comprehensive in this paper when comparing algorithms (it is impossible to be comprehensive now with 30+ aligners available), but I feel that SSAHA2 might be a real benchmark to compare against for speed and accuracy. Would you say that BFAST would run faster than SSAHA2 given similar accuracy settings?

Thanks for releasing your software. I will be trying it now.
I have not tested SSAHA2 lately, although I think SSAHA2 is built for longer reads if I am correct. Also note that since the release of the paper, BFAST's speed has improved by an order of magnitude (submitted as another paper).
nilshomer is offline   Reply With Quote
Old 11-20-2009, 11:42 AM   #12
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Michael.James.Clark View Post
Congratulations, Nils!



It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.

Anyway, we have had great success using BFAST on whole genome data from the SOLiD platform, so I am really happy to see it finally published and encourage others to try it.
At zero errors and no variants, Heng Li is saying that there should be zero false positives, although there could be false-negatives.
nilshomer is offline   Reply With Quote
Old 11-20-2009, 02:28 PM   #13
HTS
Member
 
Location: Toronto

Join Date: Nov 2009
Posts: 24
Default Performing true local alignment with BFAST?

Hi Nils,

It is nice to read your paper, and I hope that my question is not completely out of the context (admittedly I haven't got time to dig into the details of your implementation yet). Basically I am thinking about applications other than resequencing, where I just want to find all local alignments passing a certain criterion (sure, you can put an upper-limit on the total number of matches for practical reasons, but I can use any of the tools out there to filter out heavy repeats first). An example is that my reference contains 10 cDNA sequences (different splice isoforms) belonging to the same gene with their own header, and for a given read, I want to find out how many transcripts this read is mapped to. I can do this straightforwardly with BLAT (by parsing the PSL file since I have gene/transcript information in the header) but the reference itself becomes ill-defined for all the global alignment tools out there. I am wondering if your tool can be easily adapted to handle this situation (while still making use of the base quality information as opposed to BLAT). If yes, probably I can finally stop using the old favorite BLAT as well. Please feel free to give me any suggestions/comments (also include all people who read this). Thanks a lot!

-- Leo
HTS is offline   Reply With Quote
Old 11-20-2009, 02:55 PM   #14
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by HTS View Post
Hi Nils,

It is nice to read your paper, and I hope that my question is not completely out of the context (admittedly I haven't got time to dig into the details of your implementation yet). Basically I am thinking about applications other than resequencing, where I just want to find all local alignments passing a certain criterion (sure, you can put an upper-limit on the total number of matches for practical reasons, but I can use any of the tools out there to filter out heavy repeats first). An example is that my reference contains 10 cDNA sequences (different splice isoforms) belonging to the same gene with their own header, and for a given read, I want to find out how many transcripts this read is mapped to. I can do this straightforwardly with BLAT (by parsing the PSL file since I have gene/transcript information in the header) but the reference itself becomes ill-defined for all the global alignment tools out there. I am wondering if your tool can be easily adapted to handle this situation (while still making use of the base quality information as opposed to BLAT). If yes, probably I can finally stop using the old favorite BLAT as well. Please feel free to give me any suggestions/comments (also include all people who read this). Thanks a lot!

-- Leo
Absolutely! A similar example is suppose you have prior evidence of RNA splicing between two novel exons in which you only know the approximate boundaries (say within 100bp). You can create all possible joins of the two 100bp windows and then align your reads to the artificial reference to see which join is the most likely.
nilshomer is offline   Reply With Quote
Old 11-20-2009, 03:12 PM   #15
HTS
Member
 
Location: Toronto

Join Date: Nov 2009
Posts: 24
Default

Quote:
Originally Posted by nilshomer View Post
Absolutely! A similar example is suppose you have prior evidence of RNA splicing between two novel exons in which you only know the approximate boundaries (say within 100bp). You can create all possible joins of the two 100bp windows and then align your reads to the artificial reference to see which join is the most likely.
Thanks a lot for your quick reply, Nils! Could you give me a bit more details about how to do this? I remember in your paper you said when the best alignment of a read map to more than one locations on the reference equally well (in my example would be more than one cDNA sequences, which usually correspond to the same locations on the underlying genome) then you would say the read is not mapped, but I would like BFAST to output all the mappings regardless. I haven't downloaded your tool yet but will do so as soon as I finish the work at hand!
HTS is offline   Reply With Quote
Old 11-20-2009, 03:19 PM   #16
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by HTS View Post
Thanks a lot for your quick reply, Nils! Could you give me a bit more details about how to do this? I remember in your paper you said when the best alignment of a read map to more than one locations on the reference equally well (in my example would be more than one cDNA sequences, which usually correspond to the same locations on the underlying genome) then you would say the read is not mapped, but I would like BFAST to output all the mappings regardless. I haven't downloaded your tool yet but will do so as soon as I finish the work at hand!
By default, BFAST outputs all mappings for a read regardless of their quality/score. It is only in the postprocessing step do we use various criteria (specified by the user) to choose the "best" alignment. One of those criteria is to output all alignments, which is what you want.
nilshomer is offline   Reply With Quote
Old 11-20-2009, 03:23 PM   #17
HTS
Member
 
Location: Toronto

Join Date: Nov 2009
Posts: 24
Default

Quote:
Originally Posted by nilshomer View Post
By default, BFAST outputs all mappings for a read regardless of their quality/score. It is only in the postprocessing step do we use various criteria (specified by the user) to choose the "best" alignment. One of those criteria is to output all alignments, which is what you want.
Great, thanks a lot! I can't wait to try things out!
HTS is offline   Reply With Quote
Reply

Tags
alignment tool, bfast, bioinformatics, mapping, sequence alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:58 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO