SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Lower quality in reverse reads only. Is it normal? drramki Bioinformatics 8 07-31-2014 04:58 PM
Illumina RNA-seq: should I trim low quality bases prior to mapping? JonB RNA Sequencing 1 02-14-2013 04:49 AM
bowtie command line for Illumina Hiseq 2000 with Illumina 1.5+ quality encoding files rworthi Illumina/Solexa 4 09-28-2011 12:25 PM
Periodical illumina read length distribution after trimming of low-quality bases luxmare General 4 12-20-2010 04:18 PM
capital and lower case bases Layla Bioinformatics 0 06-08-2009 06:12 AM

Reply
 
Thread Tools
Old 08-12-2015, 09:20 PM   #1
NYGen
Member
 
Location: Practically Canada, NY

Join Date: Aug 2014
Posts: 20
Default WHY are the first few bases of Illumina HiSeq reads of lower quality?

Hello Seqanswers,

I'm trying to learn: Why are the first few bases of Illumina HiSeq reads of lower quality? This general impact of this question makes it feel like it has been asked numerous times before, but googling (+scholaring), seqanswer-ing, and biostar-ing have left me with no satisfactory answer to my question, which is itself based on this FastQC-generated quality score plot:



This plot is generated from a full lane of high-output Illumina V4 HiSeq 125x2 genomic sequences prior to any read QC (sequences are completely unedited). Note in the first five bases, there is a notable, lesser average base-call quality score compared to the next hundred bases. Now, what stumps me about the phenomenon emerges from this FastQC plot of the quality-trimmed reads, where no base-call with quality <25 remains in the data set:



This gives a pristinely cut-and-dry glimpse of dataset containing many millions of sequences where the first 5bp have avg quality 33 and the rest of the data has 36. What stumps me is that the very-low quality bases have been removed, meaning that the first five bases are (in consideration of the neg-log nature of quality scores) really only slightly-lower quality than the rest of the bases that follow and imply (to me) that it's useful data, but slightly less reliable. So, how can this be?

Explanations and threads I've considered:

- Illumina's FAQ says about mRNA-seq:

Quote:
Why is there a higher error rate for the first few bases?
The first two or three bases in mRNA-Seq reads have slightly elevated error rates compared to genomic DNA samples. We believe that this is an effect of the random priming process. The bases at the beginning of each read were likely at the back end of the random primer, away from the extending polymerase, during the priming process. It appears that this observation is a measurement of the mismatch pairing that is tolerated on the other end of the primer during the extension process by the polymerase.
This is the only time Illumina's FAQ discusses decreased quality rate in the beginning of reads and they chalk it up, unsatisfyingly bluntly imho, to primer/template disagreement. Also, my reads are genomic, and not from an mRNA library. Does Illumina mean to imply that this happens for genomic reads AND also happens with greater occurrence and magnitude in mRNA-seq libraries?

- The only other seqanswers thread I found discussing this point explicitly:

Quote user kmcarr:
Quote:
According to our FAS Illumina determined that their earlier error model was in fact over estimating the quality of the base calls at the beginning of the read. This was determined by plotting expected vs. observed error rates. They have adjusted the error model so that the called Q-scores more closely match observed error rates. He did not explain why the error rate for the first 10-15 bases was higher than later bases. He also mentioned that their studies indicated they had previously been under estimating Q-scores for later cycles so those have been adjusted upward in the current error model.
Does this mean that there's likely no physical difference between the first few base calls' quality, and that Illumina intentionally drags down the quality of the first few bases to fit an error model (not passing judgment on the model at all; just trying to build intuitive comprehension)?

- This thread mentions an enigma re the first five bases:

Of interest is the statement
Quote:
The reps in the [Illumina] webinar just said a non-G is required in the first five bases of the read, but for simple statistical reasons that's not likely to be much of a problem.
by user jwfoley, but it could be nothing.

Any insights are more than welcome; I'd simply like to know why the first bases have lower quality!
NYGen is offline   Reply With Quote
Old 08-13-2015, 05:30 AM   #2
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 505
Default

The first four cycles of data are used for cluster calling, and also for establishing metrics (e.g., signal thresholds) for base-calling. Base-calls for the first cycles rely on a standard set of parameters, but for subsequent cycles have been calibrated to the actual data (and are therefore more accurate).

Illumina also corrects the signal for phasing. Each cluster contains ~1000 copies, and imperfect chemistry means that some molecules are +1 or -1 relative to the actual cycle. Base-calling accuracy is improved by correcting the measurement (filtering the signal based on the preceding and subsequent cycles). Phase correction for cycle five is partially dependent upon the lower-quality preceding cycle, so it's quality is also lower (either that, or the algorithm doesn't use cycle four for phase correction). This is the same reason why quality of the last base is always significantly lower, since there's no subsequent cycle for phase correction.
HESmith is offline   Reply With Quote
Old 08-13-2015, 10:37 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

In my tests, the low quality values in at the beginning of the read are generally erroneous, just an artifact of the software. Here's an example of a HiSeq run:



The first 13bp have artificially lowered quality values. The first base really is inaccurate, but the quality seems to peak by around the 5th base. The lines for "measured" reflect the actual quality as measured by mapping and counting mismatches.
Attached Images
File Type: png Quality.png (41.1 KB, 337 views)
Brian Bushnell is offline   Reply With Quote
Old 08-13-2015, 10:54 AM   #4
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 505
Default

Quote:
Originally Posted by HESmith View Post
Base-calls for the first cycles rely on a standard set of parameters, but for subsequent cycles have been calibrated to the actual data (and are therefore more accurate).
Brian's analysis is correct (as usual). I should have said that Illumina assigns a conservative (lower) quality score to those early cycles. And you can clearly see the reduced quality of the last cycle in his graph.
HESmith is offline   Reply With Quote
Old 08-14-2015, 07:02 AM   #5
MU Core
Member
 
Location: Columbia, Missouri

Join Date: Apr 2008
Posts: 55
Default

There was a technote released by Illumina discussing the changes to the quality predictor modes used by RTA. Technote has been attached.
Attached Files
File Type: pdf RTA_Quality_Predictors_TechNote.pdf (491.6 KB, 242 views)
MU Core is offline   Reply With Quote
Old 08-14-2015, 09:04 PM   #6
NYGen
Member
 
Location: Practically Canada, NY

Join Date: Aug 2014
Posts: 20
Default

Wow, this was extremely helpful. Thanks very much for the explanation HESmith and Brian, and also MU Core for the technote which I've saved for future reference. Cheers
NYGen is offline   Reply With Quote
Reply

Tags
error, hiseq, illumina, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:29 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO