Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lankage
    Member
    • Oct 2014
    • 20

    Pac Bio fastq file quality score encoding

    I am currently working with fastq files that originated from a pac bio instrument and were converted from their native output format to fastq by some process. This was done at a different site and i haven't been able to find out what this process was yet.

    Specifically, i am interested in how the quality scores get converted to phred like scores within the quality string of the fastq files. There are tools we would like to use that expect quality scores with standard illumina format offsets. For example
    in a standard illumina fastq output file like the length truncated example below, the quality score encoding of "G" for the first base resolves to 38 with an ASCII offset of 33.

    @M01472:163:000000000-AA5WV:1:1101:10006:14422 1:N:0:217

    ACTCGGCCCA

    +

    GFFHGDFEE2

    ord("G") - 33 = 38
    This falls within the expected quality score range of 0 - 40


    With the pac bio fastq data, Im not seeing scores consistently within this range. Truncated example below:
    @m140929_224119_42136_c100670242550000001823127812201400_s1_p0/32918/ccs 1 28

    ATCTCAGTCC

    +

    qqqqqqqq=q

    offset 33 ord("q") - 33 = 80 ???

    offset 64 ord("q") - 64 = 49 ???

    Are these scores a combination of the bases of multiple reads or is there something else about the formatting i am missing?
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    PacBio uses ASCII-33 encoding, but for their reads of insert / CCS, which are consensus of the same read multiple times, they routinely assign quality values way above the standard limit of 41. This breaks a lot of tools that try to auto-detect quality encoding, so it's important to manually specify the quality encoding, or else preprocess the PacBio reads to cap their quality at 41.

    For BBTools, for example, you can use the "qin=33" flag. Also, Reformat can accept the "fixquality" flag which will cap all incoming qualities at 41 and write the corrected output file.
    Last edited by Brian Bushnell; 11-05-2014, 09:30 AM.

    Comment

    • lankage
      Member
      • Oct 2014
      • 20

      #3
      pac bio quality scores

      So a pac bio quality string score of 80 --> "q", is for all intents and purposes equivalent to a score of 41 as far as read quality filtering is concerned?

      The tool i want to use attempts to auto detect ASCII -33 or -64 offset, picks 64 offset, then throws out half the reads. Would it be appropriate to preprocess the fastq files and replace any quality characters with score > 41 with a "J".
      ord("J") - 33 = 41

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Yes, that would be fine. The super-high quality values of PacBio reads are not accurate anyway. Q41 means under 1/10000 chance of error, and anything past that is unimportant for the purposes of quality filtering (particularly if its inaccurate).

        From the BBTools package, this command will do the trick:

        reformat.sh in=reads.fq out=fixed.fq qin=33 fixquality

        Comment

        • lankage
          Member
          • Oct 2014
          • 20

          #5
          Great thanks!

          Comment

          • sunz
            Junior Member
            • Feb 2011
            • 2

            #6
            Originally posted by Brian Bushnell View Post
            Yes, that would be fine. The super-high quality values of PacBio reads are not accurate anyway. Q41 means under 1/10000 chance of error, and anything past that is unimportant for the purposes of quality filtering (particularly if its inaccurate).

            From the BBTools package, this command will do the trick:

            reformat.sh in=reads.fq out=fixed.fq qin=33 fixquality
            Hi Brian,

            I installed the BBMap 35.07 and tried to run the above reformat command but got the following error:

            "java -ea -Xmx200m -cp /home/sunz/bbmap/current/ jgi.ReformatReads in=test.fastq out=test_Qfixed.fq qin=33 fixquality
            Executing jgi.ReformatReads [in=test.fastq, out=test_Qfixed.fq, qin=33, fixquality]

            Unknown parameter fixquality
            Exception in thread "main" java.lang.AssertionError: Unknown parameter fixquality
            at jgi.ReformatReads.<init>(ReformatReads.java:168)
            at jgi.ReformatReads.main(ReformatReads.java:45)"

            Any suggestion? Thanks!

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Oh, I guess I took out that flag. It's now done automatically. You can specify a cutoff with the flag "maxcalledquality", which defaults to 41. So, the command would be:

              reformat.sh in=reads.fq out=fixed.fq qin=33 maxcalledquality=41

              ...but you can leave out the "maxcalledquality=41" if you want.

              Comment

              • sunz
                Junior Member
                • Feb 2011
                • 2

                #8
                great, thx!

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                36 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                99 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                120 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                113 views
                0 reactions
                Last Post SEQadmin2  
                Working...