We have an application for which we'd like to be able to extract information directly from 454 reads without mapping to a reference. The region we're looking at has conserved repeats interspersed with highly variable regions, and we want to extract these directly from reads and evaluate both repeat conservation and diversity of the variable regions.
Here's the question- How should we use 454 quality scores to evaluate the accuracy of variants in individual reads without mapping? Can we use them in the same way we use Sanger scores, e.g. ignoring individual bases with a cutoff score less than 20-30? My understanding of 454 quality scores, based on the 2008 Brockman paper in Genome Research, is that the algorithm used to generate them is based on validation vs a Sanger dataset. In that paper they suggest that bases with quality scores >30 have miscall errors of approx 50/Mb. We expect a higher polymorphism rate than this, so it seems like using 30 as a quality cutoff could be valid.
I'm curious to know, how do you handle analysis of base quality in individual, non-mapped reads?
Here's the question- How should we use 454 quality scores to evaluate the accuracy of variants in individual reads without mapping? Can we use them in the same way we use Sanger scores, e.g. ignoring individual bases with a cutoff score less than 20-30? My understanding of 454 quality scores, based on the 2008 Brockman paper in Genome Research, is that the algorithm used to generate them is based on validation vs a Sanger dataset. In that paper they suggest that bases with quality scores >30 have miscall errors of approx 50/Mb. We expect a higher polymorphism rate than this, so it seems like using 30 as a quality cutoff could be valid.
I'm curious to know, how do you handle analysis of base quality in individual, non-mapped reads?
Comment