Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    @lh3 and other eminent speakers in the thread

    Its nice to know that we have so many new aligners that have come up in competition to the BLAST which dominated for about 2 decades, and in many way is still the most popular.

    In your opinion, for what purposes is bwa-sw better than bwa-mem. I am assuming bwa-mem to be a more equivalent for bowtie2 for long reads such as 250 to 400 bp, but not very good for reads in the order of kilobases.

    For what purposes could bwt-sw be better applicable than bwa-sw ?

    In what ways bwa-sw is better than MegaBlast given that both do alignment for long sequences in order of kilobases or larger.

    In what way bwt-sw could be different than BLAST as though the latter uses heuristics, it still comes up with all the relevant local alignments if you tune the -e parameter accordingly, such as increasing it to 10. The website of bwt-sw states to use it at your own risk, implying that the software is not well tested!

    Narain

    Comment


    • #62
      Originally posted by narain View Post
      @lh3 and other eminent speakers in the thread

      Its nice to know that we have so many new aligners that have come up in competition to the BLAST which dominated for about 2 decades, and in many way is still the most popular.

      In your opinion, for what purposes is bwa-sw better than bwa-mem. I am assuming bwa-mem to be a more equivalent for bowtie2 for long reads such as 250 to 400 bp, but not very good for reads in the order of kilobases.

      For what purposes could bwt-sw be better applicable than bwa-sw ?

      In what ways bwa-sw is better than MegaBlast given that both do alignment for long sequences in order of kilobases or larger.

      In what way bwt-sw could be different than BLAST as though the latter uses heuristics, it still comes up with all the relevant local alignments if you tune the -e parameter accordingly, such as increasing it to 10. The website of bwt-sw states to use it at your own risk, implying that the software is not well tested!

      Narain
      I think that BWA-MEM outperforms both BWA and BWA-SW. You can check out some of the comparisons we have done on gcat (www.bioplanet.com/gcat) and I am sure others here probably have tried out BWA-MEM as well.

      Comment


      • #63
        @adaptivegenome

        Thank you for redirecting me to GCAT website. I am sure with time it will become more useful.

        At the moment, it does not well answers my concern. I want a comparison of Bowtie2 vs BWA-MEM and not Bowtie1. Also a comparison of BWA-MEM vs BWA-SW on sequences as long as several kilobases. Comparison of BWA-SW vs Megablast for several kilobases of sequences. Comparison of BWT-SW vs Blast with e-value parameter as high as 10.

        I am sure a 3rd party analysis such as GCAT will be useful. At the same time the opinions of the makers of these tools would be very useful as they are aware of the alternatives.

        Narain

        Comment


        • #64
          Originally posted by narain View Post
          @adaptivegenome

          Thank you for redirecting me to GCAT website. I am sure with time it will become more useful.

          At the moment, it does not well answers my concern. I want a comparison of Bowtie2 vs BWA-MEM and not Bowtie1. Also a comparison of BWA-MEM vs BWA-SW on sequences as long as several kilobases. Comparison of BWA-SW vs Megablast for several kilobases of sequences. Comparison of BWT-SW vs Blast with e-value parameter as high as 10.

          I am sure a 3rd party analysis such as GCAT will be useful. At the same time the opinions of the makers of these tools would be very useful as they are aware of the alternatives.

          Narain
          GCAT is using Bowtie2 and not Bowtie1. Also you can choose what tools are compared or add your own.

          You don't even need to use GCAT. I mentioned it because I did not want to make a claim of performance without being able to back it up with results. In my discussions with Heng Li, he did indicate as well that BWA-MEM was designed to replace preview BWA and BWA-SW implementations but you should certainly get his opinion directly as well.

          Comment


          • #65
            Originally posted by narain View Post
            @lh3 and other eminent speakers in the thread

            Its nice to know that we have so many new aligners that have come up in competition to the BLAST which dominated for about 2 decades, and in many way is still the most popular.
            Actually none of the new aligners takes blast as a competitor. BLAST and NGS mappers are in different domains.

            In your opinion, for what purposes is bwa-sw better than bwa-mem. I am assuming bwa-mem to be a more equivalent for bowtie2 for long reads such as 250 to 400 bp, but not very good for reads in the order of kilobases.
            You can find in the bwa manpage that both bwa-mem and bwa-sw are designed for 70bp up to a few megabases. As long as you are aware of the several differences between 500bp and 5Mbp alignment, it is not hard to write an aligner that works with a wide range of read lengths. As I have already done that in bwa-sw, I can do the similar in bwa-mem.

            I would say Bowtie2 is primarily designed for sequence reads with several hundreds bp in length and its performance drops with increasing read length.

            For what purposes could bwt-sw be better applicable than bwa-sw?
            Bwt-sw guarantees to find all local hits while bwa-sw does not. When you want to make sure a read alignment is correct, bwt-sw is the way to go.

            In what ways bwa-sw is better than MegaBlast given that both do alignment for long sequences in order of kilobases or larger.
            You can find the arguments in the bwa-sw paper. Blast and megablast are good for homology searches. However, they report local hits only, which is not quite suitable for long read, contig or BAC alignment. The bwa-sw paper gives an example:

            Say we have a 1kb read. The first 800bp, which contains an Alu, is unambiguously mapped to chr1 and the rest 200bp unambiguously to chr2. When you align this read with bwa-sw/bwa-mem, you will get two hits, as we would want to see. If you align with blast/blat/ssaha2, you will get the 800bp hits and a long list of Alu hits before seeing the 200bp hit, but for read alignment, we do not care these Alu hits as they are contained in the 800bp hit.

            In addition to this problem, blast does not compute mapping quality. For resequencing, contig alignment etc., this is a more practical metrics than e-value that is only useful for homology searching. Furthermore, bwa-sw/bwa-mem are by far faster than megablast. (PS: another problem with blast, I am not sure about megablast, is its X-dropoff heuristics. This at times leads to fragmented alignments which are hard to process.)

            In what way bwt-sw could be different than BLAST as though the latter uses heuristics, it still comes up with all the relevant local alignments if you tune the -e parameter accordingly, such as increasing it to 10. The website of bwt-sw states to use it at your own risk, implying that the software is not well tested!
            Blast uses heuristics, which means it may miss optimal hits, regardless of the -e in use. If there are two mismatches at position 11 and 22 on a 32bp read, blast will miss it as there are no seed hits. Bwt-sw will not have this problem. Bwt-sw instead has problems with repeats. Internally, bwt-sw switches to the standard SW when the score is high enough. When the input is a repeat, it may have millions of candidates, which will fail bwt-sw.

            On GCTA, the developers have contacted with me off the forum. I have one concern with the evaluation and they have acknowledged that. I think before the concern is resolved, I would take the results with a little caution. In general, though, GCTA is great. I really like the idea and its implementation, and appreciate the great efforts.
            Last edited by lh3; 04-25-2013, 10:37 AM.

            Comment


            • #66
              Dear @lh3 and @adaptivegenome

              Thank you for addressing my concerns with regard to applicability and merits of various applications. I can now use the tools wisely. Is there a paper due, to explain the improvements in bwa-mem over its predecessor bwa-sw ?

              Narain

              Comment


              • #67
                Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at http://github.com/lh3/bwa. Contact: [email protected]

                Comment


                • #68
                  An interesting paper. The results included in it seemed to agree with our evaluation results in which we found Novoalign and Subread have the best accuracy -- http://www.ncbi.nlm.nih.gov/pubmed/23558742

                  Comment


                  • #69
                    Well, in another thread, I have told you that I reviewed your manuscript. In fact, part of my review is still true. Subread is really fast, but this comes at cost of accuracy. On the data set in my manuscript, subread reports 1455 wrong alignments out of its top 1704k alignments at mapQ threshold 101 and 41770/1803k at mapQ=1 (Note that I am trying to choose favorable thresholds for subread). In comparison, bwa-mem make only 3 mistakes out of 1735k alignments at mapQ=47 and 1797/1910k at mapQ=1. As to novoalign, 2/1843k at mapQ=65 and 1010/1922k at mapQ=4. Bowtie2: 74/1713k at mapQ=24; 4701/1884k at mapQ=4. These aligners are all more accurate than subread - they report fewer wrong alignments at similar or better sensitivity.

                    In addition, Figure 4 in your paper is surprising. It shows that novoalign made 40k wrong alignments out of 10 million reads even in unique regions with repeats removed (Fig 4a), while bwa wrongly aligned 2k alignments out of 100k reads (Fig 4c; 2% error rate!). These error rates are exceedingly high. If you think my plots are biased, you can have a look at the ROC curves in the Bowtie2 paper or more clearly the recent LAST paper [PMID:23413433]. All these ROC curves, including mine, are talking about 1e-5 or even lower error rate. That is the accuracy a mapper should achieve.
                    Last edited by lh3; 04-25-2013, 06:51 PM.

                    Comment


                    • #70
                      Firstly let me point out what I received in my inbox about your reply --- "Well, in another thread, I have told you that I reviewed your manuscript. I meant to let you know my opinion without making it public. In fact, ...". I think here and before you broke the trust the prestigious journal PNAS gave to you for reviewing our paper.

                      I would also like to point out that Figure4 was NOT included in the version of our paper you reviewed.

                      Let me tell you the reason why Subread had a high error rate in your evaluation. That is because your simulation data were generated using wgsim, which assumes that the sequencing errors are uniformly distributed in the read sequence. This assumption is invalid because for example Illumina sequencing data have high sequencing errors at the start and end positions of the reads. You have agreed to this in your own post you posted a few days ago:

                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                      The parameters of our Subread aligner were tuned using the real sequencing error model. We used the sequencing error information provided in the SEQC (MAQC III) data to help generate simulation data very close to the real data to tune Subread parameters to make it work best for the real data. The fundamental difference in generating simulated reads gives rise to the remarkable differences between our evaluation and your evaluation. This difference may not change the evaluation results too much when you evaluated the family of seed-and-extend aligners, but it changed a lot for the evaluation of Subread aligner that uses a entirely different paradigm -- seed-and-vote.

                      The ERCC spike-in sequencing data included in the SEQC study is very useful in evaluating the accuracy of alternative aligners, because they are real sequencing data and they have known truth. Table 3 in our paper clearly shows the superiority of Subread aligner over other aligners.

                      Figure 4 is not surprising to us at all. It is simply because we used the real sequencing errors (extracted from SEQC reads), which were not used in other evaluations at all.

                      Comment


                      • #71
                        In my understanding, the review process being kept confidential before publication is to prevent others from stealing your ideas. As your papers has already been published, I am not sure why I cannot disclose I was a reviewer of an earlier version of your manuscript without saying which version. On the contrary, I think once the paper is published, it would be better to make the review process transparent. I am currently signing all my reviews. I am responsible for my words in the review, even if I, admittedly, give bad suggestions sometimes. On the other hand, I could be wrong about the journal policy. If I have violated, I deeply apologize and promise not to do that again.

                        You should read the bowtie2 and the recent LAST papers. Bowtie2 uses mason, the same one you use in Fig 4a. LAST is simulating sequencing errors based on arguably realistic error profiles, very similar to yours in Fig 4a. You can also ask the developers of those popular mappers if 2% error rate for mapQ>~20 alignments represents the accuracy of their mappers on real data.

                        Simulating data based on empirical quality is quite common. I did that 5 years ago (albeit that was a simpler version of simulation) and many others have done similar or better. As to seed-and-vote, both bwa-sw and bwa-mem use it. I did not emphasize because this idea was frequently seen among long-read and whole-genome alignment. For more recent papers, GSNAP and YABOS essentially use the same idea and they are using the strategy very carefully to avoid wrong alignments.

                        Comment


                        • #72
                          Figure 2(a) of the Bowtie2 paper shows that the false discovery rates of aligners were around 3 percent that is similar to what was shown in our paper.

                          The major difference between the seed-and-vote paradigm used by Subread and the one you referred to is that Subread does not perform alignments for the seed sequences which have made successful votes in its finalization step, whereas the seed-and-extend needs to perform the final alignment for almost every base in the read using dynamic programming or backtrack. Therefore Subread has a very small computational cost in its final alignment. For example, if 8 out of 10 subreads (16mers) made the successful votes when mapping a 100bp read, there will be at most 20 bases which need to aligned in the finalization step. This is one of the reasons why Subread is very fast. Another reason is that the mapping location is determined by the voting process, ie the genomic location receiving the largest number of votes (supported by the majority of the extracted subreads) is chosen as the final mapping location. The size of the genomic span of the voting subreads will be used to break ties when multiple best locations were found.

                          Given the decreasing error rate of the sequencers, seed-and-vote can quickly and accurately map most of the reads because sufficient number of good subreads can be extracted from these reads. Subread spends most of its time on the mapping of those reads which have more errors, indels, and junctions.

                          Comment


                          • #73
                            In Bowtie 2 Fig 2a, bowtie 2 and bwa essentially made no mistakes at sensitivity 90%. In your figure 4c, at the highest mapQ (the leftmost dot), bowtie 2 has a ~1500/8e4 error rate and bwa has a 2000/9e4 error rate. These are very different results from that in the bowtie 2 paper, where the error rate for high mapping quality is vanishingly small.

                            I have explained that not performing alignments for suboptimal loci is exactly why subread is fast but not as accurate as others. Due to repeats and sequencing errors/variants, a locus with fewer votes may turn out to be optimal. To achieve high accuracy, you need to see enough suboptimal loci and perform alignment around them to finally determine which has the optimal score. GSNAP is taking the right strategy. If I remember correctly, it uses votes (GSNAP does not use a terminology "vote" though) to filter bad alignments and then perform extension for the rest even if they are not optimal. YABOS, as I remember, is doing something similar. Both bwa-sw and bwa-mem also collect votes to filter out very bad loci but may perform DP extension for many suboptimal loci in repetitive regions. In some way, voting is just the starting point of these mappers. They do not stop at voting because they want to achieve higher accuracy. This is why they are more accurate.

                            Comment


                            • #74
                              Originally posted by lh3 View Post
                              In Bowtie 2 Fig 2a, bowtie 2 and bwa essentially made no mistakes at sensitivity 90%. In your figure 4c, at the highest mapQ (the leftmost dot), bowtie 2 has a ~1500/8e4 error rate and bwa has a 2000/9e4 error rate. These are very different results from that in the bowtie 2 paper, where the error rate for high mapping quality is vanishingly small.

                              I have explained that not performing alignments for suboptimal loci is exactly why subread is fast but not as accurate as others. Due to repeats and sequencing errors/variants, a locus with fewer votes may turn out to be optimal. To achieve high accuracy, you need to see enough suboptimal loci and perform alignment around them to finally determine which has the optimal score. GSNAP is taking the right strategy. If I remember correctly, it uses votes (GSNAP does not use a terminology "vote" though) to filter bad alignments and then perform extension for the rest even if they are not optimal. YABOS, as I remember, is doing something similar. Both bwa-sw and bwa-mem also collect votes to filter out very bad loci but may perform DP extension for many suboptimal loci in repetitive regions. In some way, voting is just the starting point of these mappers. They do not stop at voting because they want to achieve higher accuracy. This is why they are more accurate.
                              If you read our paper closely, you will find that the CIGAR strings of mapped reads were required to be correct in addition to mapping locations being correct, when we called correctly aligned reads in our simulation. This may explain why aligners in our simulation had higher error rates. The parameter differences when using Mason could cause this difference as well. But the overall error rates between the two studies were comparable. I wouldnt expect the results from two different study will be identical.

                              I'm not convinced by your explanation. We get around this suboptimal alignment issue by using more subreads in the voting. This gives Subread aligner a lot more power in mapping reads correctly. You will find this is very helpful if you implement this in your aligner if you truly use the seed-and-vote approach as proposed in our Subread paper.

                              Subread does not stop at voting, it employs a poweful in-fill algorithm to fill in the gaps in the read sequence after voting to finalize the alignment.

                              Comment


                              • #75
                                You should not require CIGAR to be the same. When there is an indel at the last couple of bases of a read, there is no way to find the correct CIGAR. In a tandem repeat, multiple CIGARs are equivalent. CIGAR is also greatly affected by scoring matrix. A mapper that uses scoring similar to simulation will perform much better (PS: mason as I remember does not simulate gaps under the affine-gap model, which will bias against the more biological meaningful affine-gap models), while in fact the true scoring matrix varies greatly with loci - you cannot simulate that realistically. In practice, RNA-seq/chip-seq and the discovery of structural variations does not care about CIGAR. The mainstream indel calling pipelines, including samtools, gatk and pindel, all use realignment. They do not trust CIGAR. For SNP calling, BAQ and GATK realignment are specifically designed to tackle this problem which is not solvable from single read alignment. In addition, if we worry about CIGAR, we can do a global alignment to reconstruct it, which is very fast.

                                The primary goal of a mapper is to find correct mappings or positions. Without correct mapping positions, none of the above (realignment etc.) will work. CIGAR is a secondary target. By requiring exact CIGAR, you are claiming massive correct mappings as wrong ones, but these have little effect on downstream analyses. You are evaluating an accuracy that does not have much to do with practical analyses. At the same time, the massive incorrect CIGARs completely hide the true mapping accuracy for those accurate mappers such as bowtie 2 and novoalign. The last and bowtie 2 papers are talking about 1e-5 error rate for high mapQ mappings, while you are showing 2% error rate. These are not comparable.
                                Last edited by lh3; 04-26-2013, 05:45 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                81 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X