SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Ion Torrent Mappers golharam Bioinformatics 7 11-07-2012 06:06 PM
HTS Mappers! dan Bioinformatics 10 10-20-2012 01:16 PM
what're performaces of spliced read mappers? jiyan Bioinformatics 1 10-17-2011 07:06 AM
Degenerate nucleotide mappers fenix Bioinformatics 2 03-09-2011 07:04 PM
How do variant callers deal with overlapping paired end reads? krobison Bioinformatics 1 04-30-2010 11:58 AM

Reply
 
Thread Tools
Old 04-14-2013, 08:05 AM   #1
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default New way to compare mappers and variant callers

Guys,

I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:

Http://www.bioplanet.com/gcat

You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

Would love eveyone's input. It's a cool project but brand new and it will need lots of work!
adaptivegenome is offline   Reply With Quote
Old 04-14-2013, 08:43 AM   #2
oiiio
Senior Member
 
Location: USA

Join Date: Jan 2011
Posts: 105
Default

In response to @lh3 from http://seqanswers.com/forums/showthr...t=15200&page=4

Although the metrics used on your website's comparison are similar, it looks like you used a wiggle of 50bp, which could account for differences. The curve axes are incorrect reads / mapped reads and correct reads / total in FASTQ

From other reports with different read lengths and mutations I see pretty consistently that BWA MEM does indeed deprecate BWASW, and MEM is also the most accurate open source aligner i have seen so far!
oiiio is offline   Reply With Quote
Old 04-14-2013, 05:36 PM   #3
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by adaptivegenome View Post
Guys,

I'm starting a new thread for GCAT which is a completely free and collaborative way to compare genome analysis pipelines:

Http://www.bioplanet.com/gcat

You can link to specific comparisons and they are fully interactive so this should make discussion here better because we can link to real data!

Would love eveyone's input. It's a cool project but brand new and it will need lots of work!
This looks an interesting project and I'm interested in running Subread aligner on the test datasets. But could you please provide a bit more information about the datasets such as the fragment length of PE data and the indel lengths in the small indel datasets and large indel datasets?

Thanks,
Wei
shi is offline   Reply With Quote
Old 04-14-2013, 06:19 PM   #4
oiiio
Senior Member
 
Location: USA

Join Date: Jan 2011
Posts: 105
Default

The inner distance for paired end is always 300bp (so a template size of 500bp for 100bp reads)

The small indel rate has sizes 1-10bp, and the large indel rate is 10-24bp

Looking forward to seeing the Subread results!
oiiio is offline   Reply With Quote
Old 04-14-2013, 10:14 PM   #5
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Thanks for the info, oiiio. I will try to upload the Subread results soon.

I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

Cheers,
Wei
shi is offline   Reply With Quote
Old 04-15-2013, 06:11 AM   #6
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

GCAT actually uses DWGSIM, but you are right about the assumptions. GCAT also includes real data for variant calling benchmarks but I think you are right, we should try some other simulators.

We will give MASON a try and it would be awesome if you were to upload some results in the meantime.

Thanks for the input!


Quote:
Originally Posted by shi View Post
Thanks for the info, oiiio. I will try to upload the Subread results soon.

I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

Cheers,
Wei
adaptivegenome is offline   Reply With Quote
Old 04-15-2013, 06:44 AM   #7
oiiio
Senior Member
 
Location: USA

Join Date: Jan 2011
Posts: 105
Default

This is a good point, and I think it would be very interesting to see how the alignment reports change with simulator as well. Comparing simulators that use and don't use that kind of error model could test how reliant base quality-sensitive aligners are these qualities, and also on expected error position, which would be very cool!

In the meantime you will find the variant calling test interesting also, as it is performed on real whole exome data (NA12878) on multiple platforms. With this you can compare a VCF generated by an aligner+variant caller pipeline.

For example, here is the variant calling report for the 100bp 30x coverage Illumina exome :

http://www.bioplanet.com/gcat/report...0x/bwa-gatk_ug


Quote:
Originally Posted by shi View Post
Thanks for the info, oiiio. I will try to upload the Subread results soon.

I would also like to comment on the simulation data generated in your project. It looks like you used wgsim simulator to generate reads. This simulator seems to assume that the sequencing errors are uniformly distributed in the read sequences, but we know that this is not true because for example Illumina HiSeq sequencers produce more errors at the start/end of the reads. So you may need to add other read simulator/s in your project as well.

The simulator 'Mason' was used in both bowtie2 paper and Subread paper, so you may consider using that as well. The simulator 'Art' was used in Subread paper as well.

In the simulation data you have generated using wgsim, all read bases have exactly the same quality scores ("2"), which correspond to the phred score of 17 which is quite low. This makes your datasets actually be quite different from real datasets and this will particularly affect those aligners which makes use of quality score information during their alignments. Art and Mason seem to do better in this regard.

Cheers,
Wei

Last edited by oiiio; 04-15-2013 at 06:50 AM.
oiiio is offline   Reply With Quote
Old 04-15-2013, 07:15 AM   #8
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

No read simulators simulate realistic data. ART and mason infer quality from empirical alignment, which will: 1) bias towards mappers using similar algorithms; 2) miss reads with excessive mismatches/gaps; 3) be dependent of the reference and sequences in use. I don't know much about ACANA used by ART, but RazorS used by mason does not do affine-gap alignment, which will affect its gap modeling. My understanding is both of them assume position independence. This is clearly not true for Illumina reads: some reads are systematically worse at every base than others (for example, if there is a Q2 base, the following bases are also Q2). If I am right, both ART and mason assume no variations between reads and the reference genome. This is rarely the case in practice. In an Illumina alignment, the majority of gaps, especially long gaps, are variations instead of sequencing errors. Furthermore, when we align reads to a related species ~1% divergence away from the sequenced sample, the reads will apparently have 1% uniform errors. 1% divergence is not rare when we study non human organisms. To this end, ART and mason do something better but something worse.

Wgsim uses a default sequencing error rate as high as 2% because I want to stress aligners. All mappers make no mistake for a perfectly matched read. However, with 100bp or longer reads, we should be able to access 2% divergence (for human, quite a lot of reads span two SNPs). It is these regions that show and need the capability of an aligner. Longer indels (say >5bp), which I guess are very rare from mason/art simulation, also stress the mappers a lot.

Finally, my experience is that the relative performance of mappers stay the same within a reasonably large range error configurations. The maq read simulator generates more realistic errors. I stopped using that in wgsim mainly because as I tested that time, the maq error model added little to the evaluation of mappers.


I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.

Last edited by lh3; 04-15-2013 at 07:27 AM.
lh3 is offline   Reply With Quote
Old 04-15-2013, 08:04 AM   #9
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by lh3 View Post
I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
Another great quote from Heng
adaptivegenome is offline   Reply With Quote
Old 04-15-2013, 02:48 PM   #10
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.

It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads. We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742
shi is offline   Reply With Quote
Old 04-15-2013, 04:17 PM   #11
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Quote:
Originally Posted by shi View Post
Mason and Art do allow you to specify the rates of indels and SNPs introduced to simulation reads.
Thanks for the clarification. I was reading their readme only. I can see more options from the command-line helps. ART is unable to simulate variants. Its indel error model is unclear. Mason is quite complete in features (though I have not figured out how to feed error profiles to mason). I may recommend it over wgsim/dwgsim. However, to use mason, we should still use high variant probability or high sequencing error rate to stress the mapper. The default setting is much easier than real data - due to the coalescent process, there are far more hard SNP-dense regions in real data than most simulations.

Quote:
It is a good idea to use base-calling quality scores to mutate read bases to generate errors. This is actually what did in our evaluation in the Subread paper. We took quality scores from a SEQC dataset, assigned them to reads extracted from the human genome and then mutated read bases according to their quality scores (the lower the quality scores, the more likely they will be changed to different bases). These made the simulation reads very similar to the real reads.
Also thanks for this. I missed that bit. I assume you are extracting the full quality string of a read, not a single base. Is that right (it is not clear from the text)? Using real quality has other concerns, but anyway the key to using real quality properly is to sample the quality of a whole read, instead of sample the quality of each base.

Quote:
We also included simulation reads generated from Mason and Art for the evaluation. For more details, see our paper: http://www.ncbi.nlm.nih.gov/pubmed/23558742
Actually I reviewed a version of your manuscript. I kept anonymous that time (I sign most reviews nowadays). I suggested the ROC-like representation and recommended to focus on RNA-seq.
lh3 is offline   Reply With Quote
Old 04-15-2013, 04:30 PM   #12
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Yes, we extracted the entire string of quality scores from each read in a SEQC dataset and assigned them to simulation reads.

You can feed a quality profile to Art, but Mason does not allow you do that. When running Mason, we used the same Indel and SNP rates as used in the BWA paper.

It makes more sense to me to use the typical rates of indel, SNP and the sequencing error because this will make it easier to see the close to real performance of aligners. For example, 1 SNP out of 1000 bases, 1 indel out of 10000 bases and sequencing error rate of 1 to 5 percent.
shi is offline   Reply With Quote
Old 04-15-2013, 05:44 PM   #13
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

We should use higher SNP/indel rate, because that is more real. If you plot the heterozygosity of a single individual along the genome, you will find the heterozygosity is far from uniform. Some 100kb regions may contain no SNPs but other regions may reach a heterozygosity 1-2%. That is caused primarily by coalescence and partly by selection, non-homologous gene conversions and variable recombination/mutation rates. Although the fraction of high heterozygosity regions is small, the variant error rate is much higher. If you simply simulate 1 SNP per 1000 base, you can rarely get regions with high SNP density, which deviates from real data and is less useful for evaluating the performance of a mapper in the context of variable calling and the discovery of structural variations that demands even higher mapping accuracy.

As to mason, I think it has a parameter to feed error profiles. I just do not know the format.
lh3 is offline   Reply With Quote
Old 04-15-2013, 06:30 PM   #14
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

It will certainly be useful to incorporate into the generation of simulation reads the information of the distribution of positions of SNPs and indels on the reference genome. I think the dbSNP database might be useful for this, but it seems harder for indels.
shi is offline   Reply With Quote
Old 04-16-2013, 10:25 AM   #15
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Quote:
Originally Posted by lh3 View Post
I used to think about how to simulate more realistic data. I think we should take the variants from the Venter/HuRef genome, incorporate it to the human genome without shifting the coordinates, and simulate reads from that. As to sequencing errors, we should randomly draw a the quality of a real read, generate errors from the recalibrated quality, but output raw base quality. Such a simulation will be much more realistic than anything at present. Nonetheless, for mapper evaluation, probably this would not lead to much difference, and, life is too short to do everything perfectly.
I like this idea a lot as well. We should try it sometime.
zee is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:21 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO