SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   BFAST published (http://seqanswers.com/forums/showthread.php?t=3088)

nilshomer 11-11-2009 12:03 AM

BFAST published
 
The paper describing the algorithm behind BFAST has now been published.

Homer N, Merriman B, Nelson SF.
BFAST: An alignment tool for large scale genome resequencing.
PMID: TBD
PLoS ONE. 2009 4(11): e7767.
http://dx.doi.org/10.1371/journal.pone.0007767

Background
The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25100 base range, in the presence of errors and true biological variation.

Methodology
We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.

Conclusions
We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.

ECO 11-11-2009 12:04 AM

Congrats Nils. :D

dawe 11-11-2009 08:05 AM

Congratulations!
I've downloaded it and if it runs as stated in the paper I'm ready to use it especially for resequencing and SNP projects. I'm a bit worried about speed (and memory usage) but, as said, if the error rate is so low...
SAM output is a great feature, I'm running almost aligner-independent now :-)

lh3 11-11-2009 08:55 AM

I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.

In SNP calling given short reads, alignment global to reads has some advantage. Suppose the 3rd base in a read is a true mutation. Local alignment will clip the read and as a result you see more reads having the reference allele than having the non-reference allele, which leads to strong reference bias. To alleviate this bias, we need to postprocess local alignments.

nilshomer 11-11-2009 12:26 PM

Quote:

Originally Posted by lh3 (Post 10319)
I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.

What panel are you looking at (sensitivity or false mapping)? Obviously the sensitivity cannot be 100% due to the repetitiveness of the human genome. For false-mapping, with no errors I don't see a >0% false mapping rate in Figure 3.

Quote:

In SNP calling given short reads, alignment global to reads has some advantage. Suppose the 3rd base in a read is a true mutation. Local alignment will clip the read and as a result you see more reads having the reference allele than having the non-reference allele, which leads to strong reference bias. To alleviate this bias, we need to postprocess local alignments.
We don't clip the reads. The "local alignment" we use is a global alignment to a window the candidate location. In other words, we enforce that the entire read must align to some subset of bases in the reference.

nilshomer 11-11-2009 12:30 PM

Quote:

Originally Posted by dawe (Post 10317)
Congratulations!
I've downloaded it and if it runs as stated in the paper I'm ready to use it especially for resequencing and SNP projects. I'm a bit worried about speed (and memory usage) but, as said, if the error rate is so low...
SAM output is a great feature, I'm running almost aligner-independent now :-)

@dawe
The current version allows the user to get rid of the memory problem at the cost of some speed. You use the "-d" option to split the index during index creation. This allows you to run BFAST in 4GB environments etc.

@dawe and @lh3
The assumption that the error rate is below 6% is not valid for ABI SOLiD data. With BFAST's ability to be very sensitive, we regularly observe runs with >10% (color/raw) error rates towards the ends of reads, even though after alignment the base error rate is well below 1% (kudos to 2-base encoding).

lh3 11-11-2009 12:33 PM

Quote:

Originally Posted by nilshomer (Post 10325)
What panel are you looking at (sensitivity or false mapping)? Obviously the sensitivity cannot be 100% due to the repetitiveness of the human genome. For false-mapping, with no errors I don't see a >0% false mapping rate in Figure 3.

Unless sensitivity is too bad, I do not care too much about false negatives. I am looking at false positives (2nd column). At 0-mismatch/indel, both bwa and bowtie have ~6% error rate while others have 0%.

Quote:

We don't clip the reads. The "local alignment" we use is a global alignment to a window the candidate location. In other words, we enforce that the entire read must align to some subset of bases in the reference.
This is reassuring. Thanks for the explanation.

NGSfan 11-20-2009 02:09 AM

Congratulations Nils Homer!

I really enjoyed your paper - I also must thank you for the in-depth supplement that really explained the history behind BFAST and other aligners. It shows you put a lot of thought into this.

My only question in regards to BFAST performance, is if you have recently compared it to SSAHA2 ? I know you did not intend to be comprehensive in this paper when comparing algorithms (it is impossible to be comprehensive now with 30+ aligners available), but I feel that SSAHA2 might be a real benchmark to compare against for speed and accuracy. Would you say that BFAST would run faster than SSAHA2 given similar accuracy settings?

Thanks for releasing your software. I will be trying it now.

Michael.James.Clark 11-20-2009 03:43 AM

Congratulations, Nils!

Quote:

Originally Posted by lh3 (Post 10319)
I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.

It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.

Anyway, we have had great success using BFAST on whole genome data from the SOLiD platform, so I am really happy to see it finally published and encourage others to try it. :)

lh3 11-20-2009 04:59 AM

Quote:

Originally Posted by Michael.James.Clark (Post 10671)
It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

Both bowtie and bwa are able to find all the occurrences of perfect matches and they always know if a perfect alignment is exact repeat or not. They should not produce false positives in theory. In practice, I evaluated them and they do not, at least not as high as 6%. In addition, I am curious why in 3A aligners do not do equally well for perfect hits. Again, maq and soap guarantee to find all perfect hits.

Quote:

What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.
I completely agree that Bfast is a great tool for hard reads. The performance of other aligners look reasonable, which implies bfast is very capable. I just think bowtie and bwa should be posed in the right way to avoid confusion.

nilshomer 11-20-2009 11:40 AM

Quote:

Originally Posted by NGSfan (Post 10666)
Congratulations Nils Homer!

I really enjoyed your paper - I also must thank you for the in-depth supplement that really explained the history behind BFAST and other aligners. It shows you put a lot of thought into this.

My only question in regards to BFAST performance, is if you have recently compared it to SSAHA2 ? I know you did not intend to be comprehensive in this paper when comparing algorithms (it is impossible to be comprehensive now with 30+ aligners available), but I feel that SSAHA2 might be a real benchmark to compare against for speed and accuracy. Would you say that BFAST would run faster than SSAHA2 given similar accuracy settings?

Thanks for releasing your software. I will be trying it now.

I have not tested SSAHA2 lately, although I think SSAHA2 is built for longer reads if I am correct. Also note that since the release of the paper, BFAST's speed has improved by an order of magnitude (submitted as another paper).

nilshomer 11-20-2009 11:42 AM

Quote:

Originally Posted by Michael.James.Clark (Post 10671)
Congratulations, Nils!



It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.

Anyway, we have had great success using BFAST on whole genome data from the SOLiD platform, so I am really happy to see it finally published and encourage others to try it. :)

At zero errors and no variants, Heng Li is saying that there should be zero false positives, although there could be false-negatives.

HTS 11-20-2009 02:28 PM

Performing true local alignment with BFAST?
 
Hi Nils,

It is nice to read your paper, and I hope that my question is not completely out of the context (admittedly I haven't got time to dig into the details of your implementation yet). Basically I am thinking about applications other than resequencing, where I just want to find all local alignments passing a certain criterion (sure, you can put an upper-limit on the total number of matches for practical reasons, but I can use any of the tools out there to filter out heavy repeats first). An example is that my reference contains 10 cDNA sequences (different splice isoforms) belonging to the same gene with their own header, and for a given read, I want to find out how many transcripts this read is mapped to. I can do this straightforwardly with BLAT (by parsing the PSL file since I have gene/transcript information in the header) but the reference itself becomes ill-defined for all the global alignment tools out there. I am wondering if your tool can be easily adapted to handle this situation (while still making use of the base quality information as opposed to BLAT). If yes, probably I can finally stop using the old favorite BLAT as well:rolleyes:. Please feel free to give me any suggestions/comments (also include all people who read this). Thanks a lot!

-- Leo

nilshomer 11-20-2009 02:55 PM

Quote:

Originally Posted by HTS (Post 10713)
Hi Nils,

It is nice to read your paper, and I hope that my question is not completely out of the context (admittedly I haven't got time to dig into the details of your implementation yet). Basically I am thinking about applications other than resequencing, where I just want to find all local alignments passing a certain criterion (sure, you can put an upper-limit on the total number of matches for practical reasons, but I can use any of the tools out there to filter out heavy repeats first). An example is that my reference contains 10 cDNA sequences (different splice isoforms) belonging to the same gene with their own header, and for a given read, I want to find out how many transcripts this read is mapped to. I can do this straightforwardly with BLAT (by parsing the PSL file since I have gene/transcript information in the header) but the reference itself becomes ill-defined for all the global alignment tools out there. I am wondering if your tool can be easily adapted to handle this situation (while still making use of the base quality information as opposed to BLAT). If yes, probably I can finally stop using the old favorite BLAT as well:rolleyes:. Please feel free to give me any suggestions/comments (also include all people who read this). Thanks a lot!

-- Leo

Absolutely! A similar example is suppose you have prior evidence of RNA splicing between two novel exons in which you only know the approximate boundaries (say within 100bp). You can create all possible joins of the two 100bp windows and then align your reads to the artificial reference to see which join is the most likely.

HTS 11-20-2009 03:12 PM

Quote:

Originally Posted by nilshomer (Post 10714)
Absolutely! A similar example is suppose you have prior evidence of RNA splicing between two novel exons in which you only know the approximate boundaries (say within 100bp). You can create all possible joins of the two 100bp windows and then align your reads to the artificial reference to see which join is the most likely.

Thanks a lot for your quick reply, Nils! Could you give me a bit more details about how to do this? I remember in your paper you said when the best alignment of a read map to more than one locations on the reference equally well (in my example would be more than one cDNA sequences, which usually correspond to the same locations on the underlying genome) then you would say the read is not mapped, but I would like BFAST to output all the mappings regardless. I haven't downloaded your tool yet but will do so as soon as I finish the work at hand!

nilshomer 11-20-2009 03:19 PM

Quote:

Originally Posted by HTS (Post 10716)
Thanks a lot for your quick reply, Nils! Could you give me a bit more details about how to do this? I remember in your paper you said when the best alignment of a read map to more than one locations on the reference equally well (in my example would be more than one cDNA sequences, which usually correspond to the same locations on the underlying genome) then you would say the read is not mapped, but I would like BFAST to output all the mappings regardless. I haven't downloaded your tool yet but will do so as soon as I finish the work at hand!

By default, BFAST outputs all mappings for a read regardless of their quality/score. It is only in the postprocessing step do we use various criteria (specified by the user) to choose the "best" alignment. One of those criteria is to output all alignments, which is what you want.

HTS 11-20-2009 03:23 PM

Quote:

Originally Posted by nilshomer (Post 10718)
By default, BFAST outputs all mappings for a read regardless of their quality/score. It is only in the postprocessing step do we use various criteria (specified by the user) to choose the "best" alignment. One of those criteria is to output all alignments, which is what you want.

Great, thanks a lot! I can't wait to try things out:)!


All times are GMT -8. The time now is 04:43 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.