Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • akds
    Junior Member
    • Feb 2010
    • 1

    Can my phread quality score be MAX 35?

    Hi, I am a newbie. Apologies if this is a naive question:

    A month ago, I did some analysis on sequence data. I saw that my maximum phread quality score for a position was 40. Yesterday, I got newer data and my maximum phread quality score for position is 35 --that is the largest value I observe.

    Should I linearly scale the scores so that 35 = 40? Why/how would this happen?

    Best.
  • simonandrews
    Simon Andrews
    • May 2009
    • 870

    #2
    There is no theoretical limit to a Phred score since it's just a negative log probability that a call is incorrect. In practice the range you see will depend on the program you're using as they will tend to have different upper bounds.

    It's also important to check that you're interpreting the encoding on the quality values correctly. There are at least 3 different schemes for encoding a quality value in a single character for use in FastQ files.

    Given that you're seeing a difference of 5 between your data sets I'd take a wild guess and say that you've got one file from an illumina pipeline >v1.3 and one from an illumina pipeline <v1.3. There was a change in the way Illumina encoded their quality values (which you can't easily detect from looking at the FastQ files), which caused all quality values to be offset by 5.

    More details about this mess can be found in the wikipedia article on FastQ files:

    Comment

    • maubp
      Peter (Biopython etc)
      • Jul 2009
      • 1544

      #3
      Originally posted by akds View Post
      Why/how would this happen?
      It could just be the second run wasn't as good data:

      Were both runs from the same biological sample? If not, it couple be down to not-quite-as-good sample prepartion in the second run.

      It could also have been a not-so-good run on the sequencing machine (e.g. the second batch of reagents may not have been as good).

      You could ask your sequencing center if they are aware of any general issues at the time you second sample was run.

      Comment

      • sunnyvu
        Member
        • Mar 2010
        • 17

        #4
        I have the similar question. In my data, the phread quality score are MAX 65. Is this reasonable? From the wikipedia article on FastQ files, the phread quality score are in the range [0-40]. Did I misunderstand something?
        Thanks.

        Comment

        • maubp
          Peter (Biopython etc)
          • Jul 2009
          • 1544

          #5
          Originally posted by sunnyvu View Post
          I have the similar question. In my data, the phread quality score are MAX 65. Is this reasonable? From the wikipedia article on FastQ files, the phread quality score are in the range [0-40]. Did I misunderstand something?
          Thanks.
          That does sound unusually high for raw read quality score (but fine for a consensus built by aligning multiple reads). What kind of data is it?

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Originally posted by simonandrews View Post
            Given that you're seeing a difference of 5 between your data sets I'd take a wild guess and say that you've got one file from an illumina pipeline >v1.3 and one from an illumina pipeline <v1.3. There was a change in the way Illumina encoded their quality values (which you can't easily detect from looking at the FastQ files), which caused all quality values to be offset by 5.
            That's not quite right. The offset is the same (ASCII 64), but they switched from Solexa (minimum -5) to PHRED scores (minimum 0). For good quality reads this makes no difference. For poor reads it is important, note Solexa -5 and PHRED 0 are about equivalent. The wikipedia page you linked to and the references within does cover this.

            Comment

            • sunnyvu
              Member
              • Mar 2010
              • 17

              #7
              Originally posted by maubp View Post
              That does sound unusually high for raw read quality score (but fine for a consensus built by aligning multiple reads). What kind of data is it?
              I used the R package ShortRead to look at the data from Illumina. It's likely that ShortRead converted the scores from phred64 to sanger sequencing.

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Originally posted by sunnyvu View Post
                I used the R package ShortRead to look at the data from Illumina. It's likely that ShortRead converted the scores from phred64 to sanger sequencing.
                Double check this. Failing to do the conversion will inflate the scores by 31.


                It seems much more likely to be that the conversion wasn't done. i.e. I think you really have Illumina FASTQ files with a maximum PHRED score of 34 (could be better, but still pretty good), however they were read in as Sanger FASTQ files and thus wrongly interpreted as having PHRED scores up to 65 (far too high for raw reads).
                Last edited by maubp; 04-13-2010, 01:56 PM. Reason: typo

                Comment

                • sunnyvu
                  Member
                  • Mar 2010
                  • 17

                  #9
                  hi maubp,

                  Yes, I do have Illumina FASTQ files. Now I already figured out my problem. I would like to share the command in R.
                  ####
                  library(ShortRead)
                  reads <- readFastq("s_1_1_sequence.txt")
                  if (length(reads) >1000000){ reads <- sample(reads, 1000000) }
                  qual <- SFastqQuality(quality(quality(reads))) # 'S' standing for 'Solexa'.
                  readM <- as(qual, "matrix")
                  pdf(file="s_1_1.pdf")
                  boxplot(as.data.frame((readM)), outline = FALSE, main="Per Cycle Read Quality", xlab="Cycle", ylab="Phred Quality")
                  dev.off()

                  Thank you very much!

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    That makes sense sunnyvu.

                    However there is a subtle difference between early Solexa/Illumina FASTQ files (using Solexa scores) and those from Illumina 1.3 or later (using PHRED scores). Which type of Illumina FASTQ files do you have? As previously noted, for good scores this doesn't really matter (above 10 the two quality scores are effectively interchangeable).

                    You should check the ShortRead documentation to see if they consider this, or ask on the BioConductor mailing list.

                    Comment

                    • kmcarr
                      Senior Member
                      • May 2008
                      • 1181

                      #11
                      Originally posted by maubp View Post
                      I think you really have Illumina FASTQ files with a maximum PHRED score of 34 (could be better, but still pretty good)...
                      It appears that the current versions of RTA and Bustard (1.6) cap the Q-score they will assign at 34.

                      Comment

                      • sunnyvu
                        Member
                        • Mar 2010
                        • 17

                        #12
                        Originally posted by maubp View Post
                        That makes sense sunnyvu.


                        You should check the ShortRead documentation to see if they consider this, or ask on the BioConductor mailing list.
                        I looked around and did not find any parameter for Illumina 1.3+. I solved my problems using HTSeq.

                        Thank you very much!

                        Have a good weekend.

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        19 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 11:40 AM
                        0 responses
                        14 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        29 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-26-2026, 10:12 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...