Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa

Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina FASTQ Quality Scores - Missing Value Bio.X2Y Bioinformatics 24 08-29-2013 08:01 AM
Ideas on collecting quality scores per base in an illumina fastq file brachysclereid Bioinformatics 11 12-05-2011 02:00 PM
Illumina quality scores dlepp Illumina/Solexa 6 03-01-2011 12:09 AM
Illumina quality scores ewilbanks Bioinformatics 3 11-10-2010 09:52 AM
Illumina 1.3 v 1.8 quality scores Graham Etherington Bioinformatics 1 10-18-2010 08:00 AM

Thread Tools
Old 11-10-2014, 10:33 AM   #1
Junior Member
Location: Salt Lake City

Join Date: Oct 2014
Posts: 5
Default illumina quality scores first 2-3 bp reliability?

Hi All-
I apologize if this has been asked- I searched and could not find any answer that address this question-

I was speaking to someone who is quite familiar with actually running HiSeq machines and asked a question about the origin of the lower quality scores for the first couple (3-4 seems to be when things settle down) bp of every read. At least for my data set, FastQC clearly shows lower (relatively) quality scores for the first few bp- showing 34 for reads 1-4 and 38 for the rest of the run.

I got an answer I did not expect, and can not find discussion about.

Basically- and this is a rough summary as I did not take notes. I was told that the way the quality scores are calculated during the run is through use of an algorithm that utilizes information about the read quality for a few preceding bases (unsure what its looking at but I think not simply the Q score, probably something about the relative ratio of the signal intensity for the various colors relative to each other). And that the first few reads do not (of course) fulfill this requirement for their algorithm and therefore are induced (artificially) to have lower scores... but should not be considered to be of low(er) quality.

This is really not a serious concern as the FastQC report shows Q scores of 34 for the first 3-4 bp then jumps to 38 for basically the rest of the read, so I don't think I am going to loose anything by quality trimming as 34 is quite good... but I was curious if the general reasoning behind the "low" quality score for the first few bp is correct and if I should ever think about this again with respect to quality trimming reads.

Last edited by rufessor; 11-10-2014 at 10:41 AM.
rufessor is offline   Reply With Quote
Old 11-10-2014, 11:52 AM   #2
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

Quality trimming to such high levels is generally not a good idea. For most purposes - mapping, assembly, merging, etc - you will get better results using a lower threshold, below 20. I tend to use 6-10 in general and 15 at the most.

The quality scores of the first few bases are not accurate. As you said, I believe they are artificially lowered to compensate for the fact that the base caller has not been trained yet, or the cluster locations have not been nailed down precisely. The true accuracy is substantially higher, in general - though that might not be the case when sequencing low-diversity libraries where you selectively amplified some specific gene sequence.

If you have a reference (or any assembly), you can run BBMap with the flag "mhist=mhist.txt" to produce a histogram of the match/substitution/deletion/insertion rates at every base location in the reads - this is the most accurate way that I know of to determine whether trimming is needed.
Brian Bushnell is offline   Reply With Quote

illumina, quality, rnaseq, trim

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 07:03 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO