Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quality Score: FastQC vs Illumina

    Hello,

    I have a question in regarding Illumina quality scores. Which quality control is more reliable: FastQC or the Illumina Sample Summary Information from the Illumina pipeline?

    Here is why I ask:

    I just get my sequencing data back (from a Hiseq 2000 machine, 50 base run). Based on the Illumina Sample Summary/Report, the quality of the my dataset is decent. The Illumina Sample Summary Information tells me that: The Mean Quality SCore (PF) is 28.43, and %>Q30 bases (PF) is 69.53.

    However, when I run my data through FastQC, it tells me that the quality of my data is really really bad (please see the attached images). If you look at the two plots attached, the Mean Quality Score is much much worse than 28.43.

    Why is there a discrepancy between the two quality reports? Which one should I believe?

    Also, this is the first time our High-throughput Sequencing facility uses the new Illumina pipeline, CASAVA v1.8. I know in the new pipeline the Quality Scores are different from the old one. Could this change explain why FastQC (on Galaxy (version 0.10.0)) thinks my data is poor quality?

    Thank you in advance for your help!

    -Eric
    Attached Files

  • #2
    Hi Eric,

    The FASTQ files might contain all reads, not just the ones that passed quality filtering. In each description line of a read, there should be an N or a Y, indicating if the read has been filtered. There seems to be a large number of reads with a mean quality score of 2, and those probably don't pass filtering and so aren't included in the stats Illumina reports. You might try filtering down to those that pass filtering and see if you get similar results.

    Justin

    Comment


    • #3
      If you run fastqc with the --casava option set then it will remove any reads which were flagged to fail the illumina QC filter. If you're using the latest version of Casava (1.8.2) then these reads are no longer reported in the fastq output.

      Comment


      • #4
        Both are bad, either your library is poor or their sequencing.

        Comment


        • #5
          Thank you very much for your reply. I filtered my reads (I did this with 2% of my total data) with Quality score > 3. This filtered dataset is about 0.65% of the input file, and has a mean quality score of ~28 (see attachment), which is consistent with the Illumina report.

          I realize that my data is poor. I am just wondering if it is usable. Some people I talk to say that even if a read has poor quality score, it is ok to use as long as it is a perfect match to the genome. Is this true? What's your take on this?
          Attached Files

          Comment


          • #6
            Originally posted by ericguo View Post
            Thank you very much for your reply. I filtered my reads (I did this with 2% of my total data) with Quality score > 3. This filtered dataset is about 0.65% of the input file, and has a mean quality score of ~28 (see attachment), which is consistent with the Illumina report.

            I realize that my data is poor. I am just wondering if it is usable. Some people I talk to say that even if a read has poor quality score, it is ok to use as long as it is a perfect match to the genome. Is this true? What's your take on this?
            If you only got decent results from less than 1% of your library then I'd not have huge confidence in those sequences. You could try mapping them and seeing if you get sensible results. We've had libraries which were 95% adapter where we got useful results from the remaining 5%.

            One other possibility exists. If your library has biased composition then the Illumina base caller can sometimes get confused and produce poor base calls and quality assignments from what is actually good primary data. You'd be able to see this in the composition plots from FastQC. If this is the case then you can normally rescue these libraries by reanalysing with a fixed calibration matrix and fixed phasing parameters. May be a long shot, but we've seen it happen a few times.

            Comment


            • #7
              Originally posted by ericguo View Post
              Thank you very much for your reply. I filtered my reads (I did this with 2% of my total data) with Quality score > 3. This filtered dataset is about 0.65% of the input file, and has a mean quality score of ~28 (see attachment), which is consistent with the Illumina report.

              I realize that my data is poor. I am just wondering if it is usable. Some people I talk to say that even if a read has poor quality score, it is ok to use as long as it is a perfect match to the genome. Is this true? What's your take on this?
              I call this the "Bennetzen Dictum":

              Don't waste clean thoughts on dirty data.
              It doesn't necessarily answer your question because you will want to calibrate what constitutes "dirty" for yourself. But I think it is worthwhile to consider whenever you have come to the point where you are considering investing some effort analyzing a questionable data set.

              Anyone who has worked in science for a period of time has been there. You have some data -- usually you have invested some effort in obtaining it. But the results are marginal. Do you abandon this data (invoke the Bennetzen dictum), or persevere?

              There is no correct answer. That isn't the point. The point is you are making a choice. Do that consciously. Don't let hours become days, days weeks, and weeks years without deliberation. Yeah, that will come across as officious and trite. But I have seen it happen many times.

              --
              Phillip

              Comment


              • #8
                Hi Everyone

                I want to do RNAseq from FFPE material. I know this is a big ask. If I can get FASTQC scores across all sequences of 38 with a nice tight peak, is that sufficient?

                kind regards
                Charlotte

                Comment


                • #9
                  Originally posted by cproby View Post
                  Hi Everyone

                  I want to do RNAseq from FFPE material. I know this is a big ask. If I can get FASTQC scores across all sequences of 38 with a nice tight peak, is that sufficient?

                  kind regards
                  Charlotte
                  Yes, Phred scores of 38 is plenty good enough - however the problems you're likely to hit from FFPE material are not likely to result in poor sequencing scores, but in high duplication levels, or from contamination, so there will be other bits of QC you're going to need to do.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X