Seqanswers Leaderboard Ad

**rskr** · 08-02-2011, 08:26 AM

Originally posted by rkusko View Post

Hey all,
We're doing mRNA-seq on the Illumina Hi-Seq, and when we look at the quality score as a function of read position, we see that the quality initially increases, then decreases. We also see this when we do miRNA-seq on the Hi-Seq. With the GAII we see the quality decreases as you travel further out on the read, which is what I would expect. Has anyone else seen this on their Hi-Seq?

What we see on the Hi-Seq:

What we see on the GAII:

(Plots generated using fastqc)

Hypothesis: It seems that Hi-seq has an algorithm that uses several iterations of the chemistry to determine the precise dimensions of the clusters, then continues to refine that as it goes along, since it can't afford to store all of the images on the local drive(by default), from beginning to end. The worst case is when the bases are homogenous at the start, no clusters can be identified, it can recover somewhat. I am guessing that even in non-homogenous sequences, it can better define the clusters with more iterations until the chemistry starts to deteriorate.

One way to test this hypothesis is to store the images and do an analysis with the complete set of images.

**pmiguel** · 08-02-2011, 10:02 AM

The most recent version of the software that came with the v3 chemistry down-grades the quality of early bases of a read. Not sure what the basis for this is.

Here is a plot from a v3 run we are in the midst of:

(Color denotes raw cluster density)

Looks like there are a population of the tiles in early cycles that have low signal to noise. Perhaps these are the source of the lower quality values?

--
Phillip

**PeteH** · 08-02-2011, 03:30 PM

Thanks for the explanation Phillip. I've also been seeing this pattern of quality scores in my latest Hi-Seq data (exome-sequencing).

**greigite** · 08-03-2011, 07:25 AM

We are seeing the same thing with exactly the same step pattern of quality. I'm not sure how to handle this downstream since it essentially means that the quality scores in the first three bases, and possibly those in the next 2 blocks of five bases, cannot be relied upon to be accurate. Should I just be trimming off the first three bases as a matter of course?

Attached Files

per_base_quality.jpg (53.8 KB, 65 views)

**pmiguel** · 08-03-2011, 07:53 AM

Reminds me of the old Steve Martin routine on his stereo system...

Q30 is still very high quality (1 error per 1000 base calls). Sure we would love to have more of those Q40 (1 error per 10,000 base calls), but have some perspective. Mean quality values on an Ion Torrent don't tend to go above Q20 (1 error per 100 base calls).

--
Phillip

**greigite** · 08-03-2011, 07:58 AM

Originally posted by pmiguel View Post

Reminds me of the old Steve Martin routine on his stereo system...

Q30 is still very high quality (1 error per 1000 base calls). Sure we would love to have more of those Q40 (1 error per 10,000 base calls), but have some perspective. Mean quality values on an Ion Torrent don't tend to go above Q20 (1 error per 100 base calls).

--
Phillip

My concern is whether those q30-32 scores for the first 3 bases have anything to do with reality. Another user here has some data from amplicon sequencing that suggests they do not (variation from known primer sequence is much more than 1 in 1000).

**pmiguel** · 08-03-2011, 08:17 AM

Originally posted by greigite View Post

My concern is whether those q30-32 scores for the first 3 bases have anything to do with reality. Another user here has some data from amplicon sequencing that suggests they do not (variation from known primer sequence is much more than 1 in 1000).

Reasonable concern. But that is not a good test of the quality values. It probably reflects the oligo synthesis error rate.

Maybe look at the phiX error rate vs quality values?

--
Phillip

**epistatic** · 08-09-2011, 03:05 PM

I have this pattern in all of my HiSeq runs, even in the phiX lanes. I was initially worried and sent the Qscore by Cycle plots to Illumina. They responded:

"Your Illumina FAS asked me to follow up with you on some questions you had about the Qscore pattern you are seeing in your sequencing data, particularly at the very start of the run. I can confirm that this pattern is normal, and the lower scores at the beginning 15 cycles or so is typical. I've attached an example of a typical Qscore heat map from a run with the same software versions as those you are using. "

http://dl.dropbox.com/u/30955182/Typ...0HCS%2014x.pdf

**pmiguel** · 08-10-2011, 02:39 AM

So the question is whether runs prior to the v3 chemistry software change have quality values that are inaccurate during the 1st 15 cycles -- too high.

Partially related: For some reason we add and extra cycle to our reads; 101 bases instead of 100. The quality values for base 101 are always substantially lower than for base 100. My suspicions is that is a bogus downgrading of the quality of that base.

--
Phillip

**kmcarr** · 08-10-2011, 05:18 AM

Originally posted by pmiguel View Post

So the question is whether runs prior to the v3 chemistry software change have quality values that are inaccurate during the 1st 15 cycles -- too high.

Partially related: For some reason we add and extra cycle to our reads; 101 bases instead of 100. The quality values for base 101 are always substantially lower than for base 100. My suspicions is that is a bogus downgrading of the quality of that base.

--
Phillip

Phillip,

According to our FAS Illumina determined that their earlier error model was in fact over estimating the quality of the base calls at the beginning of the read. This was determined by plotting expected vs. observed error rates. They have adjusted the error model so that the called Q-scores more closely match observed error rates. He did not explain why the error rate for the first 10-15 bases was higher than later bases. He also mentioned that their studies indicated they had previously been under estimating Q-scores for later cycles so those have been adjusted upward in the current error model.

Regarding the last cycle, phasing/prephasing numbers can not be included in the error calculation for the last cycle of a read since for any cycle n you need data from cycle n+1 to estimate (pre)phasing. This is the rationale for adding the extra cycle to run and trimming it from the final output. The Q-score is lower because there is incomplete data to fit to the error model, thus lower confidence in the result.

**pmiguel** · 08-10-2011, 05:52 AM

Ah, that makes sense.

As far as error rate vs. quality scores. The SAV (Illumina's "Sequence Analysis Viewer") allows some possibly informative plots. Like cycle vs "error rate" on a phiX lane. Seems like the error rate bottoms our around cycle 5 or so, just below 0.1%. To my mind that would be Q30 average quality. Whereas for that lane looks like the median Q value at that cycle is around 35.

Is that what you see?

--
Phillip

**kmcarr** · 08-17-2011, 05:56 AM

Originally posted by pmiguel View Post

Ah, that makes sense.

As far as error rate vs. quality scores. The SAV (Illumina's "Sequence Analysis Viewer") allows some possibly informative plots. Like cycle vs "error rate" on a phiX lane. Seems like the error rate bottoms our around cycle 5 or so, just below 0.1%. To my mind that would be Q30 average quality. Whereas for that lane looks like the median Q value at that cycle is around 35.

Is that what you see?

--
Phillip

Phillip,

I'm dealing with a small sample size, our HiSeq was just installed and it's still in the middle of read 2 of its setup run (2 x 101 cycles) with the PhiX flowcell. That said, I would concur with your assessment that the error rate bottoms out @ ~cycle 4-5 but I would eyeball it at ~0.05%. It stays about at this level up through cycle 25 and then slowly climbs. At cycle 101 the median error rate was ~0.8% overall, the lowest lane was ~0.6% and the highest ~1.2%.

Kevin

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 26 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Hi-Seq quality score behavior

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News