Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pac Bio fastq file quality score encoding

    I am currently working with fastq files that originated from a pac bio instrument and were converted from their native output format to fastq by some process. This was done at a different site and i haven't been able to find out what this process was yet.

    Specifically, i am interested in how the quality scores get converted to phred like scores within the quality string of the fastq files. There are tools we would like to use that expect quality scores with standard illumina format offsets. For example
    in a standard illumina fastq output file like the length truncated example below, the quality score encoding of "G" for the first base resolves to 38 with an ASCII offset of 33.

    @M01472:163:000000000-AA5WV:1:1101:10006:14422 1:N:0:217

    ACTCGGCCCA

    +

    GFFHGDFEE2

    ord("G") - 33 = 38
    This falls within the expected quality score range of 0 - 40


    With the pac bio fastq data, Im not seeing scores consistently within this range. Truncated example below:
    @m140929_224119_42136_c100670242550000001823127812201400_s1_p0/32918/ccs 1 28

    ATCTCAGTCC

    +

    qqqqqqqq=q

    offset 33 ord("q") - 33 = 80 ???

    offset 64 ord("q") - 64 = 49 ???

    Are these scores a combination of the bases of multiple reads or is there something else about the formatting i am missing?

  • #2
    PacBio uses ASCII-33 encoding, but for their reads of insert / CCS, which are consensus of the same read multiple times, they routinely assign quality values way above the standard limit of 41. This breaks a lot of tools that try to auto-detect quality encoding, so it's important to manually specify the quality encoding, or else preprocess the PacBio reads to cap their quality at 41.

    For BBTools, for example, you can use the "qin=33" flag. Also, Reformat can accept the "fixquality" flag which will cap all incoming qualities at 41 and write the corrected output file.
    Last edited by Brian Bushnell; 11-05-2014, 09:30 AM.

    Comment


    • #3
      pac bio quality scores

      So a pac bio quality string score of 80 --> "q", is for all intents and purposes equivalent to a score of 41 as far as read quality filtering is concerned?

      The tool i want to use attempts to auto detect ASCII -33 or -64 offset, picks 64 offset, then throws out half the reads. Would it be appropriate to preprocess the fastq files and replace any quality characters with score > 41 with a "J".
      ord("J") - 33 = 41

      Comment


      • #4
        Yes, that would be fine. The super-high quality values of PacBio reads are not accurate anyway. Q41 means under 1/10000 chance of error, and anything past that is unimportant for the purposes of quality filtering (particularly if its inaccurate).

        From the BBTools package, this command will do the trick:

        reformat.sh in=reads.fq out=fixed.fq qin=33 fixquality

        Comment


        • #5
          Great thanks!

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            Yes, that would be fine. The super-high quality values of PacBio reads are not accurate anyway. Q41 means under 1/10000 chance of error, and anything past that is unimportant for the purposes of quality filtering (particularly if its inaccurate).

            From the BBTools package, this command will do the trick:

            reformat.sh in=reads.fq out=fixed.fq qin=33 fixquality
            Hi Brian,

            I installed the BBMap 35.07 and tried to run the above reformat command but got the following error:

            "java -ea -Xmx200m -cp /home/sunz/bbmap/current/ jgi.ReformatReads in=test.fastq out=test_Qfixed.fq qin=33 fixquality
            Executing jgi.ReformatReads [in=test.fastq, out=test_Qfixed.fq, qin=33, fixquality]

            Unknown parameter fixquality
            Exception in thread "main" java.lang.AssertionError: Unknown parameter fixquality
            at jgi.ReformatReads.<init>(ReformatReads.java:168)
            at jgi.ReformatReads.main(ReformatReads.java:45)"

            Any suggestion? Thanks!

            Comment


            • #7
              Oh, I guess I took out that flag. It's now done automatically. You can specify a cutoff with the flag "maxcalledquality", which defaults to 41. So, the command would be:

              reformat.sh in=reads.fq out=fixed.fq qin=33 maxcalledquality=41

              ...but you can leave out the "maxcalledquality=41" if you want.

              Comment


              • #8
                great, thx!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                31 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X