SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
average quality score for fastq file jgibbons1 Bioinformatics 12 07-16-2013 01:13 PM
Illumina quality score encoding for galaxy grooming Mshegrss General 2 03-14-2012 06:53 AM
Pac Bio Developer Network just went up... ECO Pacific Biosciences 0 07-09-2010 07:43 AM
Fastq quliaty score and MAQ output quality score baohua100 Bioinformatics 1 02-19-2009 09:21 AM
Pac Bio raises $100mill for sequencer commercialization ECO The Pipeline 0 07-15-2008 07:12 AM

Reply
 
Thread Tools
Old 11-05-2014, 08:24 AM   #1
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default Pac Bio fastq file quality score encoding

I am currently working with fastq files that originated from a pac bio instrument and were converted from their native output format to fastq by some process. This was done at a different site and i haven't been able to find out what this process was yet.

Specifically, i am interested in how the quality scores get converted to phred like scores within the quality string of the fastq files. There are tools we would like to use that expect quality scores with standard illumina format offsets. For example
in a standard illumina fastq output file like the length truncated example below, the quality score encoding of "G" for the first base resolves to 38 with an ASCII offset of 33.

@M01472:163:000000000-AA5WV:1:1101:10006:14422 1:N:0:217

ACTCGGCCCA

+

GFFHGDFEE2

ord("G") - 33 = 38
This falls within the expected quality score range of 0 - 40


With the pac bio fastq data, Im not seeing scores consistently within this range. Truncated example below:
@m140929_224119_42136_c100670242550000001823127812201400_s1_p0/32918/ccs 1 28

ATCTCAGTCC

+

qqqqqqqq=q

offset 33 ord("q") - 33 = 80 ???

offset 64 ord("q") - 64 = 49 ???

Are these scores a combination of the bases of multiple reads or is there something else about the formatting i am missing?
lankage is offline   Reply With Quote
Old 11-05-2014, 08:27 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

PacBio uses ASCII-33 encoding, but for their reads of insert / CCS, which are consensus of the same read multiple times, they routinely assign quality values way above the standard limit of 41. This breaks a lot of tools that try to auto-detect quality encoding, so it's important to manually specify the quality encoding, or else preprocess the PacBio reads to cap their quality at 41.

For BBTools, for example, you can use the "qin=33" flag. Also, Reformat can accept the "fixquality" flag which will cap all incoming qualities at 41 and write the corrected output file.

Last edited by Brian Bushnell; 11-05-2014 at 08:30 AM.
Brian Bushnell is offline   Reply With Quote
Old 11-05-2014, 08:36 AM   #3
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default pac bio quality scores

So a pac bio quality string score of 80 --> "q", is for all intents and purposes equivalent to a score of 41 as far as read quality filtering is concerned?

The tool i want to use attempts to auto detect ASCII -33 or -64 offset, picks 64 offset, then throws out half the reads. Would it be appropriate to preprocess the fastq files and replace any quality characters with score > 41 with a "J".
ord("J") - 33 = 41
lankage is offline   Reply With Quote
Old 11-05-2014, 09:10 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yes, that would be fine. The super-high quality values of PacBio reads are not accurate anyway. Q41 means under 1/10000 chance of error, and anything past that is unimportant for the purposes of quality filtering (particularly if its inaccurate).

From the BBTools package, this command will do the trick:

reformat.sh in=reads.fq out=fixed.fq qin=33 fixquality
Brian Bushnell is offline   Reply With Quote
Old 11-05-2014, 09:12 AM   #5
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default

Great thanks!
lankage is offline   Reply With Quote
Old 06-18-2015, 01:05 PM   #6
sunz
Junior Member
 
Location: Massachusetts

Join Date: Feb 2011
Posts: 2
Default

Quote:
Originally Posted by Brian Bushnell View Post
Yes, that would be fine. The super-high quality values of PacBio reads are not accurate anyway. Q41 means under 1/10000 chance of error, and anything past that is unimportant for the purposes of quality filtering (particularly if its inaccurate).

From the BBTools package, this command will do the trick:

reformat.sh in=reads.fq out=fixed.fq qin=33 fixquality
Hi Brian,

I installed the BBMap 35.07 and tried to run the above reformat command but got the following error:

"java -ea -Xmx200m -cp /home/sunz/bbmap/current/ jgi.ReformatReads in=test.fastq out=test_Qfixed.fq qin=33 fixquality
Executing jgi.ReformatReads [in=test.fastq, out=test_Qfixed.fq, qin=33, fixquality]

Unknown parameter fixquality
Exception in thread "main" java.lang.AssertionError: Unknown parameter fixquality
at jgi.ReformatReads.<init>(ReformatReads.java:168)
at jgi.ReformatReads.main(ReformatReads.java:45)"

Any suggestion? Thanks!
sunz is offline   Reply With Quote
Old 06-18-2015, 03:02 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh, I guess I took out that flag. It's now done automatically. You can specify a cutoff with the flag "maxcalledquality", which defaults to 41. So, the command would be:

reformat.sh in=reads.fq out=fixed.fq qin=33 maxcalledquality=41

...but you can leave out the "maxcalledquality=41" if you want.
Brian Bushnell is offline   Reply With Quote
Old 06-19-2015, 06:26 AM   #8
sunz
Junior Member
 
Location: Massachusetts

Join Date: Feb 2011
Posts: 2
Default

great, thx!
sunz is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO