whereisshe 11-18-2010 12:40 PM

Illumina quality score

Hi everyone,

Currently I started to study of DNA assembly and I found something I could not understand in the following paper:

http://www.sciencedirect.com/science...1&searchtype=a

The figure 1 in the paper shows the relationship between the quality score and the number of errors in reads. They said that the quality score 40 meant 0.01% of error probability and that was true it the following equation was used:

Q = -10log(p/(1-p))

However, according to the graph, only about 65% bases which have the score 40 are correct. Moreover, the percentage of correct bases which have smaller than 40 is almost 0 for all values. I wonder whether this trend is usual or not.
Thank you.

 obig 11-18-2010 02:55 PM

I can see your confusion. There is something very strange about figure 1 in their paper. Panel (b) might make sense if what they were plotting was error rate relative to base position for a 40bp read. How can panel (a) even go up to 41 if the max Illumina quality score is 40 as they themselves state in the text?

Look at these papers for better treatment of the question of quality scores vs error rates:
http://genome.cshlp.org/content/18/5/763.long
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2577856/
http://genomebiology.com/2009/10/8/R83

 whereisshe 11-18-2010 04:00 PM

Hi obig,

The largest number in the graph (a) may be 40 and 41 may be displayed because of the wrong setting in Excel. Anyway, they did not explain the number.
Thank you for recommending papers. I appreciate it.

Yun

 Gerard te Meerman 11-26-2010 06:45 AM

Illumina quality scores and deviations from ref sequence

I have collected some data on the distribution of the Illumina quality scores as function of the base position and the actual number of deviations from the reference sequence for an exome enriched sample (101 read length). The Illumina quality score assigns for base 2 the lowest score in 0.04% of the cases. For baseposition 75 this is already 19%, and for base position 100 this figure is 49%. This correlates not very well with observed rates of differences with the Human ref37 genome, with 80% of reads mapped with an exact four single base maximum error model. For baseposition 2 there are 0.5% deviations, for base position 75 1 % and for base position 100 3.5%. You may interpolate the intermediate positions for a reasonable fit. My conclusion is that the Illumina quality score has a very limited relation with observed deviations from the reference sequence. Most deviations are actually errors because the mutational load in the human exome is much lower than the observed rate in exome sequencing. A quality score should differentiate much better in the lower regions of quality to be useful for base calling.

 All times are GMT -8. The time now is 01:42 PM.