Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • adaptivegenome
    Super Moderator
    • Nov 2009
    • 436

    New way to compare mappers and variant callers

    Guys,

    I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:



    You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

    Would love eveyone's input. It's a cool project but brand new and it will need lots of work!
  • oiiio
    Senior Member
    • Jan 2011
    • 105

    #2
    In response to @lh3 from http://seqanswers.com/forums/showthr...t=15200&page=4

    Although the metrics used on your website's comparison are similar, it looks like you used a wiggle of 50bp, which could account for differences. The curve axes are incorrect reads / mapped reads and correct reads / total in FASTQ

    From other reports with different read lengths and mutations I see pretty consistently that BWA MEM does indeed deprecate BWASW, and MEM is also the most accurate open source aligner i have seen so far!

    Comment

    • shi
      Wei Shi
      • Feb 2010
      • 236

      #3
      Originally posted by adaptivegenome View Post
      Guys,

      I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:



      You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

      Would love eveyone's input. It's a cool project but brand new and it will need lots of work!
      This looks an interesting project and I'm interested in running Subread aligner on the test datasets. But could you please provide a bit more information about the datasets such as the fragment length of PE data and the indel lengths in the small indel datasets and large indel datasets?

      Thanks,
      Wei

      Comment

      • oiiio
        Senior Member
        • Jan 2011
        • 105

        #4
        The inner distance for paired end is always 300bp (so a template size of 500bp for 100bp reads)

        The small indel rate has sizes 1-10bp, and the large indel rate is 10-24bp

        Looking forward to seeing the Subread results!

        Comment

        • shi
          Wei Shi
          • Feb 2010
          • 236

          #5
          Thanks for the info, oiiio. I will try to upload the Subread results soon.

          I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

          The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

          In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

          Cheers,
          Wei

          Comment

          • adaptivegenome
            Super Moderator
            • Nov 2009
            • 436

            #6
            GCAT actually uses DWGSIM, but you are right about the assumptions. GCAT also includes real data for variant calling benchmarks but I think you are right, we should try some other simulators.

            We will give MASON a try and it would be awesome if you were to upload some results in the meantime.

            Thanks for the input!


            Originally posted by shi View Post
            Thanks for the info, oiiio. I will try to upload the Subread results soon.

            I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

            The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

            In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

            Cheers,
            Wei

            Comment

            • oiiio
              Senior Member
              • Jan 2011
              • 105

              #7
              This is a good point, and I think it would be very interesting to see how the alignment reports change with simulator as well. Comparing simulators that use and don't use that kind of error model could test how reliant base quality-sensitive aligners are these qualities, and also on expected error position, which would be very cool!

              In the meantime you will find the variant calling test interesting also, as it is performed on real whole exome data (NA12878) on multiple platforms. With this you can compare a VCF generated by an aligner+variant caller pipeline.

              For example, here is the variant calling report for the 100bp 30x coverage Illumina exome :




              Originally posted by shi View Post
              Thanks for the info, oiiio. I will try to upload the Subread results soon.

              I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

              The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

              In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

              Cheers,
              Wei
              Last edited by oiiio; 04-15-2013, 06:50 AM.

              Comment

              • lh3
                Senior Member
                • Feb 2008
                • 686

                #8
                No read simulators simulate realistic data. ART and mason infer quality from empirical alignment, which will: 1) bias towards mappers using similar algorithms; 2) miss reads with excessive mismatches/gaps; 3) be dependent of the reference and sequences in use. I don't know much about ACANA used by ART, but RazorS used by mason does not do affine-gap alignment, which will affect its gap modeling. My understanding is both of them assume position independence. This is clearly not true for Illumina reads: some reads are systematically worse at every base than others (for example, if there is a Q2 base, the following bases are also Q2). If I am right, both ART and mason assume no variations between reads and the reference genome. This is rarely the case in practice. In an Illumina alignment, the majority of gaps, especially long gaps, are variations instead of sequencing errors. Furthermore, when we align reads to a related species ~1% divergence away from the sequenced sample, the reads will apparently have 1% uniform errors. 1% divergence is not rare when we study non human organisms. To this end, ART and mason do something better but something worse.

                Wgsim uses a default sequencing error rate as high as 2% because I want to stress aligners. All mappers make no mistake for a perfectly matched read. However, with 100bp or longer reads, we should be able to access 2% divergence (for human, quite a lot of reads span two SNPs). It is these regions that show and need the capability of an aligner. Longer indels (say >5bp), which I guess are very rare from mason/art simulation, also stress the mappers a lot.

                Finally, my experience is that the relative performance of mappers stay the same within a reasonably large range error configurations. The maq read simulator generates more realistic errors. I stopped using that in wgsim mainly because as I tested that time, the maq error model added little to the evaluation of mappers.


                I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
                Last edited by lh3; 04-15-2013, 07:27 AM.

                Comment

                • adaptivegenome
                  Super Moderator
                  • Nov 2009
                  • 436

                  #9
                  Originally posted by lh3 View Post
                  I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
                  Another great quote from Heng

                  Comment

                  • shi
                    Wei Shi
                    • Feb 2010
                    • 236

                    #10
                    Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.

                    It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads. We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742

                    Comment

                    • lh3
                      Senior Member
                      • Feb 2008
                      • 686

                      #11
                      Originally posted by shi View Post
                      Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.
                      Thanks for the clarification. I was reading their readme only. I can see more options from the command-line helps. ART is unable to simulate variants. Its indel error model is unclear. Mason is quite complete in features (though I have not figured out how to feed error profiles to mason). I may recommend it over wgsim/dwgsim. However, to use mason, we should still use high variant probability or high sequencing error rate to stress the mapper. The default setting is much easier than real data - due to the coalescent process, there are far more hard SNP-dense regions in real data than most simulations.

                      It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads.
                      Also thanks for this. I missed that bit. I assume you are extracting the full quality string of a read, not a single base. Is that right (it is not clear from the text)? Using real quality has other concerns, but anyway the key to using real quality properly is to sample the quality of a whole read, instead of sample the quality of each base.

                      We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742
                      Actually I reviewed a version of your manuscript. I kept anonymous that time (I sign most reviews nowadays). I suggested the ROC-like representation and recommended to focus on RNA-seq.

                      Comment

                      • shi
                        Wei Shi
                        • Feb 2010
                        • 236

                        #12
                        Yes, we extracted the entire string of quality scores from each read in a SEQC dataset and assigned them to simulation reads.

                        You can feed a quality profile to Art, but Mason does not allow you do that. When running Mason, we used the same Indel and SNP rates as used in the BWA paper.

                        It makes more sense to me to use the typical rates of indel, SNP and the sequencing error because this will make it easier to see the close to real performance of aligners. For example, 1 SNP out of 1000 bases, 1 indel out of 10000 bases and sequencing error rate of 1 to 5 percent.

                        Comment

                        • lh3
                          Senior Member
                          • Feb 2008
                          • 686

                          #13
                          We should use higher SNP/indel rate, because that is more real. If you plot the heterozygosity of a single individual along the genome, you will find the heterozygosity is far from uniform. Some 100kb regions may contain no SNPs but other regions may reach a heterozygosity 1-2%. That is caused primarily by coalescence and partly by selection, non-homologous gene conversions and variable recombination/mutation rates. Although the fraction of high heterozygosity regions is small, the variant error rate is much higher. If you simply simulate 1 SNP per 1000 base, you can rarely get regions with high SNP density, which deviates from real data and is less useful for evaluating the performance of a mapper in the context of variable calling and the discovery of structural variations that demands even higher mapping accuracy.

                          As to mason, I think it has a parameter to feed error profiles. I just do not know the format.

                          Comment

                          • shi
                            Wei Shi
                            • Feb 2010
                            • 236

                            #14
                            It will certainly be useful to incorporate into the generation of simulation reads the information of the distribution of positions of SNPs and indels on the reference genome. I think the dbSNP database might be useful for this, but it seems harder for indels.

                            Comment

                            • zee
                              NGS specialist
                              • Apr 2008
                              • 249

                              #15
                              Originally posted by lh3 View Post
                              I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
                              I like this idea a lot as well. We should try it sometime.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                Today, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 06:09 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              36 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...