SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Threshold quality score to determine the quality read of ILLUMINA reads problem edge Illumina/Solexa 35 11-02-2015 10:31 AM
Questions on the updated illumina quality score zeam Bioinformatics 6 10-26-2011 11:08 AM
about illumina reads quality score gridbird Illumina/Solexa 4 08-08-2011 05:10 AM
Illumina quality score whereisshe Bioinformatics 3 11-26-2010 06:45 AM
Threshold quality score to determine the quality read of ILLUMINA reads problem edge General 1 09-13-2010 02:22 PM

Reply
 
Thread Tools
Old 12-06-2011, 03:36 PM   #1
ericguo
Junior Member
 
Location: new haven

Join Date: Sep 2011
Posts: 9
Default Quality Score: FastQC vs Illumina

Hello,

I have a question in regarding Illumina quality scores. Which quality control is more reliable: FastQC or the Illumina Sample Summary Information from the Illumina pipeline?

Here is why I ask:

I just get my sequencing data back (from a Hiseq 2000 machine, 50 base run). Based on the Illumina Sample Summary/Report, the quality of the my dataset is decent. The Illumina Sample Summary Information tells me that: The Mean Quality SCore (PF) is 28.43, and %>Q30 bases (PF) is 69.53.

However, when I run my data through FastQC, it tells me that the quality of my data is really really bad (please see the attached images). If you look at the two plots attached, the Mean Quality Score is much much worse than 28.43.

Why is there a discrepancy between the two quality reports? Which one should I believe?

Also, this is the first time our High-throughput Sequencing facility uses the new Illumina pipeline, CASAVA v1.8. I know in the new pipeline the Quality Scores are different from the old one. Could this change explain why FastQC (on Galaxy (version 0.10.0)) thinks my data is poor quality?

Thank you in advance for your help!

-Eric
Attached Images
File Type: png per_base_quality.png (10.8 KB, 210 views)
File Type: png per_sequence_quality.png (20.4 KB, 133 views)
ericguo is offline   Reply With Quote
Old 12-06-2011, 07:09 PM   #2
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Hi Eric,

The FASTQ files might contain all reads, not just the ones that passed quality filtering. In each description line of a read, there should be an N or a Y, indicating if the read has been filtered. There seems to be a large number of reads with a mean quality score of 2, and those probably don't pass filtering and so aren't included in the stats Illumina reports. You might try filtering down to those that pass filtering and see if you get similar results.

Justin
BAMseek is offline   Reply With Quote
Old 12-07-2011, 12:09 AM   #3
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

If you run fastqc with the --casava option set then it will remove any reads which were flagged to fail the illumina QC filter. If you're using the latest version of Casava (1.8.2) then these reads are no longer reported in the fastq output.
simonandrews is offline   Reply With Quote
Old 12-07-2011, 06:10 AM   #4
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

Both are bad, either your library is poor or their sequencing.
NextGenSeq is offline   Reply With Quote
Old 12-07-2011, 10:02 AM   #5
ericguo
Junior Member
 
Location: new haven

Join Date: Sep 2011
Posts: 9
Default

Thank you very much for your reply. I filtered my reads (I did this with 2% of my total data) with Quality score > 3. This filtered dataset is about 0.65% of the input file, and has a mean quality score of ~28 (see attachment), which is consistent with the Illumina report.

I realize that my data is poor. I am just wondering if it is usable. Some people I talk to say that even if a read has poor quality score, it is ok to use as long as it is a perfect match to the genome. Is this true? What's your take on this?
Attached Images
File Type: png per_base_quality_filtered.png (12.4 KB, 94 views)
File Type: png per_sequence_quality_filtered.png (20.6 KB, 50 views)
ericguo is offline   Reply With Quote
Old 12-08-2011, 12:20 AM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by ericguo View Post
Thank you very much for your reply. I filtered my reads (I did this with 2% of my total data) with Quality score > 3. This filtered dataset is about 0.65% of the input file, and has a mean quality score of ~28 (see attachment), which is consistent with the Illumina report.

I realize that my data is poor. I am just wondering if it is usable. Some people I talk to say that even if a read has poor quality score, it is ok to use as long as it is a perfect match to the genome. Is this true? What's your take on this?
If you only got decent results from less than 1% of your library then I'd not have huge confidence in those sequences. You could try mapping them and seeing if you get sensible results. We've had libraries which were 95% adapter where we got useful results from the remaining 5%.

One other possibility exists. If your library has biased composition then the Illumina base caller can sometimes get confused and produce poor base calls and quality assignments from what is actually good primary data. You'd be able to see this in the composition plots from FastQC. If this is the case then you can normally rescue these libraries by reanalysing with a fixed calibration matrix and fixed phasing parameters. May be a long shot, but we've seen it happen a few times.
simonandrews is offline   Reply With Quote
Old 12-08-2011, 04:24 AM   #7
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by ericguo View Post
Thank you very much for your reply. I filtered my reads (I did this with 2% of my total data) with Quality score > 3. This filtered dataset is about 0.65% of the input file, and has a mean quality score of ~28 (see attachment), which is consistent with the Illumina report.

I realize that my data is poor. I am just wondering if it is usable. Some people I talk to say that even if a read has poor quality score, it is ok to use as long as it is a perfect match to the genome. Is this true? What's your take on this?
I call this the "Bennetzen Dictum":

Quote:
Don't waste clean thoughts on dirty data.
It doesn't necessarily answer your question because you will want to calibrate what constitutes "dirty" for yourself. But I think it is worthwhile to consider whenever you have come to the point where you are considering investing some effort analyzing a questionable data set.

Anyone who has worked in science for a period of time has been there. You have some data -- usually you have invested some effort in obtaining it. But the results are marginal. Do you abandon this data (invoke the Bennetzen dictum), or persevere?

There is no correct answer. That isn't the point. The point is you are making a choice. Do that consciously. Don't let hours become days, days weeks, and weeks years without deliberation. Yeah, that will come across as officious and trite. But I have seen it happen many times.

--
Phillip
pmiguel is offline   Reply With Quote
Old 10-22-2015, 03:15 AM   #8
cproby
Junior Member
 
Location: Dundee, Scotland

Join Date: Oct 2015
Posts: 2
Default

Hi Everyone

I want to do RNAseq from FFPE material. I know this is a big ask. If I can get FASTQC scores across all sequences of 38 with a nice tight peak, is that sufficient?

kind regards
Charlotte
cproby is offline   Reply With Quote
Old 10-22-2015, 04:08 AM   #9
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by cproby View Post
Hi Everyone

I want to do RNAseq from FFPE material. I know this is a big ask. If I can get FASTQC scores across all sequences of 38 with a nice tight peak, is that sufficient?

kind regards
Charlotte
Yes, Phred scores of 38 is plenty good enough - however the problems you're likely to hit from FFPE material are not likely to result in poor sequencing scores, but in high duplication levels, or from contamination, so there will be other bits of QC you're going to need to do.
simonandrews is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO