SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
minion error rate mido1951 Oxford Nanopore 15 11-06-2015 01:17 AM
Illumina GAIIx sequecing error rate (background) afalvarez Illumina/Solexa 4 08-22-2014 07:25 AM
rate Ask for help! how to integrate BAC scaffolds and Illumina Data wzhangvv Bioinformatics 1 11-19-2012 03:17 PM
error rate der_eiskern Illumina/Solexa 0 12-11-2009 02:51 PM

Reply
 
Thread Tools
Old 11-30-2015, 10:11 AM   #1
ClemBuntu
Member
 
Location: Lyon

Join Date: Dec 2014
Posts: 37
Default Illumina Error rate

Hello everyone,

That's must be a silly question but there is something I don't understand : the error rate of illumina sequencing (calculating using phix genome) is around 1%. But most of the read have a mean quality >= Q30 which is 0.1%.

So what does the error rate actually mean ?
ClemBuntu is offline   Reply With Quote
Old 11-30-2015, 04:33 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can't estimate error rates with a linear average of log-transformed values. If you map to phiX with BBMap, like this:

bbmap.sh in=reads.fq ref=phix.fa mhist=mhist.txt qhist=qhist.txt qahist=qahist.txt

...then graph qhist with something like Excel or R, you will see the difference between the linear and logarithmic averages, as well as the actual error rate. The actual error rate will much more closely resemble the lines of the logarithmic averages.

qahist.txt (quality accuracy histogram), on the other hand, will show the actual measured quality values for each quality score, which will tell you how accurate their quality scores are. Normally they're not too far off, but it highly depends on the specific platform and software version.
Brian Bushnell is offline   Reply With Quote
Old 12-08-2015, 07:23 AM   #3
ClemBuntu
Member
 
Location: Lyon

Join Date: Dec 2014
Posts: 37
Default

Sorry but I don't understand.

What's the point of qhist histogram ?

And about qahist, am I supposed to see that most of my reads have a quality greater or equal than 35 ?
Attached Images
File Type: jpg qhist.JPG (30.4 KB, 35 views)
File Type: jpg qahist.JPG (29.6 KB, 23 views)
ClemBuntu is offline   Reply With Quote
Old 12-08-2015, 01:09 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The qhist shows you, per position in the read, what the expected error rate is (that's read1_log) and what the actual average error rate is (that's read1_measured). As you can see, those completely disagree so the quality scores are very inaccurate (almost meaningless) for this library, and that the error rates are high - the average error rate starts at Q20 (1%) and drops to Q11 (8%) by the end. The X axis is read position.

The qahist is plotted incorrectly - you need to plot the "Quality" column as the X-axis, and the "TrueQuality" column as the Y-axis; discard the other columns.
Brian Bushnell is offline   Reply With Quote
Old 12-09-2015, 12:10 AM   #5
ClemBuntu
Member
 
Location: Lyon

Join Date: Dec 2014
Posts: 37
Default

Quote:
Originally Posted by Brian Bushnell View Post
The qhist shows you, per position in the read, what the expected error rate is (that's read1_log) and what the actual average error rate is (that's read1_measured). As you can see, those completely disagree so the quality scores are very inaccurate (almost meaningless) for this library, and that the error rates are high -
What do you mean 'expected' ?


I'm not sure to understand the scale.
Quote:
Originally Posted by Brian Bushnell View Post
the average error rate starts at Q20 (1%) and drops to Q11 (8%) by the end.
So I guess the Y axe is the log quality value, 20,000 is Q20 and you say it regarding to the orange curve right ?



Quote:
Originally Posted by Brian Bushnell View Post

The qahist is plotted incorrectly - you need to plot the "Quality" column as the X-axis, and the "TrueQuality" column as the Y-axis; discard the other columns.
I have drawn the qahist again, does it show the quality is from Q8 to Q20 as well ?
Attached Images
File Type: jpg quality.JPG (27.3 KB, 27 views)
ClemBuntu is offline   Reply With Quote
Old 12-09-2015, 05:34 AM   #6
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

Quote:
Originally Posted by ClemBuntu View Post
What do you mean 'expected' ?
The quality scores produced by the sequencer are expected (i.e., predicted or calculated) scores; see http://www.illumina.com/documents/pr...ity_scores.pdf for an explanation.

Quote:
Originally Posted by ClemBuntu View Post
I'm not sure to understand the scale.

So I guess the Y axe is the log quality value, 20,000 is Q20 and you say it regarding to the orange curve right ?
Yes, 20,000 is Q20. The relevant curves are the orange one (expected scores) vs. the gray one (actual scores based on your phiX data). These two curves should overlap if the expected scores accurately reflect the true error rate. They do not. As Brian indicated, the true error rate begins at Q20 and declines to Q11 at the end of the read. These quality scores are very low.
HESmith is offline   Reply With Quote
Old 12-09-2015, 10:04 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by ClemBuntu View Post
I have drawn the qahist again, does it show the quality is from Q8 to Q20 as well ?
Each point on the qahist indicates the claimed quality (from the quality scores) versus the measured quality (based on the alignment match/mismatch rate). So, for example, you have a point at X=22, Y=13. That means that if you take all bases with a stated quality score of Q22 (roughly 99.3% accuracy), on average, they have an error rate indicating Q13 (roughly 95% accuracy).
Brian Bushnell is offline   Reply With Quote
Old 12-09-2015, 12:29 PM   #8
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

Hey Brian, what's the point at (0,60)? It looks like all the Q0s are 99.9999% accurate!

[ClemBuntu, ignore this message. It's a joke. The lowest quality produced by Illumina is Q2]
HESmith is offline   Reply With Quote
Old 12-09-2015, 12:37 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

That's actually a valid point! The quality scores vary by platform and software version (I'm guessing this is NextSeq V1 chemistry, or HiSeq 4000). Normally, for non-binned quality scores (like HiSeq 2000), there is 0 (for N), 2, then 5-41. Quite often Q2 bases are more accurate than Q5 bases, as 2 has a special meaning. But sometimes, called bases (A, C, G, T) are produced with Q0 assigned. It's normally very few, under 100. I suspect it's a bug in Casava. But, due to the fact that there are so few, it's not uncommon for them to all match the reference. To keep the axes finite, I cap all quality values at 60, but technically, in this case, the Q0 bases are 100% accurate (Q infinity). I wouldn't count on it in general, though
Brian Bushnell is offline   Reply With Quote
Old 12-09-2015, 12:52 PM   #10
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

Thanks, Brian. As always, you're a fount of knowledge.
HESmith is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:30 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO