Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BFAST published

    The paper describing the algorithm behind BFAST has now been published.

    Homer N, Merriman B, Nelson SF.
    BFAST: An alignment tool for large scale genome resequencing.
    PMID: TBD
    PLoS ONE. 2009 4(11): e7767.
    Background The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation. Methodology We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels. Conclusions We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.


    Background
    The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation.

    Methodology
    We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.

    Conclusions
    We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.

  • #2
    Congrats Nils.

    Comment


    • #3
      Congratulations!
      I've downloaded it and if it runs as stated in the paper I'm ready to use it especially for resequencing and SNP projects. I'm a bit worried about speed (and memory usage) but, as said, if the error rate is so low...
      SAM output is a great feature, I'm running almost aligner-independent now :-)

      Comment


      • #4
        I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.

        In SNP calling given short reads, alignment global to reads has some advantage. Suppose the 3rd base in a read is a true mutation. Local alignment will clip the read and as a result you see more reads having the reference allele than having the non-reference allele, which leads to strong reference bias. To alleviate this bias, we need to postprocess local alignments.

        Comment


        • #5
          Originally posted by lh3 View Post
          I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.
          What panel are you looking at (sensitivity or false mapping)? Obviously the sensitivity cannot be 100% due to the repetitiveness of the human genome. For false-mapping, with no errors I don't see a >0% false mapping rate in Figure 3.

          In SNP calling given short reads, alignment global to reads has some advantage. Suppose the 3rd base in a read is a true mutation. Local alignment will clip the read and as a result you see more reads having the reference allele than having the non-reference allele, which leads to strong reference bias. To alleviate this bias, we need to postprocess local alignments.
          We don't clip the reads. The "local alignment" we use is a global alignment to a window the candidate location. In other words, we enforce that the entire read must align to some subset of bases in the reference.

          Comment


          • #6
            Originally posted by dawe View Post
            Congratulations!
            I've downloaded it and if it runs as stated in the paper I'm ready to use it especially for resequencing and SNP projects. I'm a bit worried about speed (and memory usage) but, as said, if the error rate is so low...
            SAM output is a great feature, I'm running almost aligner-independent now :-)
            @dawe
            The current version allows the user to get rid of the memory problem at the cost of some speed. You use the "-d" option to split the index during index creation. This allows you to run BFAST in 4GB environments etc.

            @dawe and @lh3
            The assumption that the error rate is below 6% is not valid for ABI SOLiD data. With BFAST's ability to be very sensitive, we regularly observe runs with >10% (color/raw) error rates towards the ends of reads, even though after alignment the base error rate is well below 1% (kudos to 2-base encoding).

            Comment


            • #7
              Originally posted by nilshomer View Post
              What panel are you looking at (sensitivity or false mapping)? Obviously the sensitivity cannot be 100% due to the repetitiveness of the human genome. For false-mapping, with no errors I don't see a >0% false mapping rate in Figure 3.
              Unless sensitivity is too bad, I do not care too much about false negatives. I am looking at false positives (2nd column). At 0-mismatch/indel, both bwa and bowtie have ~6% error rate while others have 0%.

              We don't clip the reads. The "local alignment" we use is a global alignment to a window the candidate location. In other words, we enforce that the entire read must align to some subset of bases in the reference.
              This is reassuring. Thanks for the explanation.

              Comment


              • #8
                Congratulations Nils Homer!

                I really enjoyed your paper - I also must thank you for the in-depth supplement that really explained the history behind BFAST and other aligners. It shows you put a lot of thought into this.

                My only question in regards to BFAST performance, is if you have recently compared it to SSAHA2 ? I know you did not intend to be comprehensive in this paper when comparing algorithms (it is impossible to be comprehensive now with 30+ aligners available), but I feel that SSAHA2 might be a real benchmark to compare against for speed and accuracy. Would you say that BFAST would run faster than SSAHA2 given similar accuracy settings?

                Thanks for releasing your software. I will be trying it now.

                Comment


                • #9
                  Congratulations, Nils!

                  Originally posted by lh3 View Post
                  I did not read into details, but something is a little bit odd in Figure 3. How can bowtie and bwa produce wrong alignments even if the read can be perfectly mapped? For perfectly aligned reads, both of them should always give correct results. Also note that 6% error is really a lot. If an aligner was that bad, it would not survive these days. Nonetheless, I agree that the performance of bwa degrades with higher error/mutation rate.
                  It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

                  What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.

                  Anyway, we have had great success using BFAST on whole genome data from the SOLiD platform, so I am really happy to see it finally published and encourage others to try it.
                  Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                  Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                  Projects: U87MG whole genome sequence [Website] [Paper]

                  Comment


                  • #10
                    Originally posted by Michael.James.Clark View Post
                    It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.
                    Both bowtie and bwa are able to find all the occurrences of perfect matches and they always know if a perfect alignment is exact repeat or not. They should not produce false positives in theory. In practice, I evaluated them and they do not, at least not as high as 6%. In addition, I am curious why in 3A aligners do not do equally well for perfect hits. Again, maq and soap guarantee to find all perfect hits.

                    What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.
                    I completely agree that Bfast is a great tool for hard reads. The performance of other aligners look reasonable, which implies bfast is very capable. I just think bowtie and bwa should be posed in the right way to avoid confusion.
                    Last edited by lh3; 11-20-2009, 05:02 AM.

                    Comment


                    • #11
                      Originally posted by NGSfan View Post
                      Congratulations Nils Homer!

                      I really enjoyed your paper - I also must thank you for the in-depth supplement that really explained the history behind BFAST and other aligners. It shows you put a lot of thought into this.

                      My only question in regards to BFAST performance, is if you have recently compared it to SSAHA2 ? I know you did not intend to be comprehensive in this paper when comparing algorithms (it is impossible to be comprehensive now with 30+ aligners available), but I feel that SSAHA2 might be a real benchmark to compare against for speed and accuracy. Would you say that BFAST would run faster than SSAHA2 given similar accuracy settings?

                      Thanks for releasing your software. I will be trying it now.
                      I have not tested SSAHA2 lately, although I think SSAHA2 is built for longer reads if I am correct. Also note that since the release of the paper, BFAST's speed has improved by an order of magnitude (submitted as another paper).

                      Comment


                      • #12
                        Originally posted by Michael.James.Clark View Post
                        Congratulations, Nils!



                        It makes sense to me that they would have higher false positive rates at zero errors because they identify significantly more true positives as well (as 3A is basically gauging true positives and 3B is gauging false positives). My guess is that they are aligning questionable reads that the other aligners are not aligning at all in order to have a higher yield of true positives. I would pose the question of whether one can get BWA and Bowtie to generate fewer false positives, and how that would thereby affect their true positive yield at zero errors. (I have not used them personally, so it's an open question.) Then again, I think it's rather a moot point. The figure is basically demonstrating to me that all of these aligners do quite well at zero-two nucleotide errors.

                        What I find more striking in that particular figure is how they perform with high error reads, but even more strikingly, reads containing deletions (even deletions of a relatively large size). This is important to me because a lot of studies, particularly whole genome, are searching for novel variants and deletions with relatively low coverage.

                        Anyway, we have had great success using BFAST on whole genome data from the SOLiD platform, so I am really happy to see it finally published and encourage others to try it.
                        At zero errors and no variants, Heng Li is saying that there should be zero false positives, although there could be false-negatives.

                        Comment


                        • #13
                          Performing true local alignment with BFAST?

                          Hi Nils,

                          It is nice to read your paper, and I hope that my question is not completely out of the context (admittedly I haven't got time to dig into the details of your implementation yet). Basically I am thinking about applications other than resequencing, where I just want to find all local alignments passing a certain criterion (sure, you can put an upper-limit on the total number of matches for practical reasons, but I can use any of the tools out there to filter out heavy repeats first). An example is that my reference contains 10 cDNA sequences (different splice isoforms) belonging to the same gene with their own header, and for a given read, I want to find out how many transcripts this read is mapped to. I can do this straightforwardly with BLAT (by parsing the PSL file since I have gene/transcript information in the header) but the reference itself becomes ill-defined for all the global alignment tools out there. I am wondering if your tool can be easily adapted to handle this situation (while still making use of the base quality information as opposed to BLAT). If yes, probably I can finally stop using the old favorite BLAT as well. Please feel free to give me any suggestions/comments (also include all people who read this). Thanks a lot!

                          -- Leo

                          Comment


                          • #14
                            Originally posted by HTS View Post
                            Hi Nils,

                            It is nice to read your paper, and I hope that my question is not completely out of the context (admittedly I haven't got time to dig into the details of your implementation yet). Basically I am thinking about applications other than resequencing, where I just want to find all local alignments passing a certain criterion (sure, you can put an upper-limit on the total number of matches for practical reasons, but I can use any of the tools out there to filter out heavy repeats first). An example is that my reference contains 10 cDNA sequences (different splice isoforms) belonging to the same gene with their own header, and for a given read, I want to find out how many transcripts this read is mapped to. I can do this straightforwardly with BLAT (by parsing the PSL file since I have gene/transcript information in the header) but the reference itself becomes ill-defined for all the global alignment tools out there. I am wondering if your tool can be easily adapted to handle this situation (while still making use of the base quality information as opposed to BLAT). If yes, probably I can finally stop using the old favorite BLAT as well. Please feel free to give me any suggestions/comments (also include all people who read this). Thanks a lot!

                            -- Leo
                            Absolutely! A similar example is suppose you have prior evidence of RNA splicing between two novel exons in which you only know the approximate boundaries (say within 100bp). You can create all possible joins of the two 100bp windows and then align your reads to the artificial reference to see which join is the most likely.

                            Comment


                            • #15
                              Originally posted by nilshomer View Post
                              Absolutely! A similar example is suppose you have prior evidence of RNA splicing between two novel exons in which you only know the approximate boundaries (say within 100bp). You can create all possible joins of the two 100bp windows and then align your reads to the artificial reference to see which join is the most likely.
                              Thanks a lot for your quick reply, Nils! Could you give me a bit more details about how to do this? I remember in your paper you said when the best alignment of a read map to more than one locations on the reference equally well (in my example would be more than one cDNA sequences, which usually correspond to the same locations on the underlying genome) then you would say the read is not mapped, but I would like BFAST to output all the mappings regardless. I haven't downloaded your tool yet but will do so as soon as I finish the work at hand!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X