SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
454 quality score, z-score,.. nii 454 Pyrosequencing 2 10-14-2011 09:46 AM
Questions about solexa quality score! baohua100 Bioinformatics 23 05-20-2009 11:36 PM
Two Version of Solexa Quality Score Formula foolishbrat Bioinformatics 1 02-24-2009 01:59 AM
Fastq quliaty score and MAQ output quality score baohua100 Bioinformatics 1 02-19-2009 09:21 AM
Questions about solexa quality score baohua100 Bioinformatics 1 06-17-2008 08:09 AM

Reply
 
Thread Tools
Old 01-09-2009, 05:36 AM   #1
foolishbrat
Member
 
Location: South East Asia

Join Date: Nov 2008
Posts: 44
Default Interpreting Quality Score (Solexa)

Dear all,

Usually we find this kind of quality error of Solexa tag

Code:
-33   31  -40  -34      -40  -40  -40   40       27  -27  -40  -40
Each four-numbers correspond to 1 base. Hence, the above
quality refer to length 3 tags (e.g. "tca").

My question are as follows:
  1. What is the reasonable way to find single number to represent each base? (e.g should we average the 4 figures or pick the highest score out of 4?)
  2. How can we interpret the figure? e.g. Is base with positive quality score is better than negative quality score?
  3. In general, how do people use this type of quality score information?
foolishbrat is offline   Reply With Quote
Old 01-09-2009, 08:45 AM   #2
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Avergeing would be bad. Each number in the set of 4 represents the score for A,C,G, or T respectively. So the sequence for your little bit there is CTA, because in the first base, the second number is the highest, and in the second 4-some, the fourth base is the highest, and in the third, the first base is the highest.

The scores are Solexa quality scores, not exactly the same as Sanger quality score, though when the score is > 15, the two are virtually identical. There is a conversion equation around to convert the Solexa scores to Sanger scores, and an equation telling you what the error rate of a given Sanger quality score are supposed to be.

A lot of alignment programs don't use the quality scores at all in alignment, though they will output the quality scores of mismatches, which helps you determine how likely it is that teh mismatch is a real polymorphism, and not an error. But read depth probably tells you more than quality scores when it comes to SNPs.
swbarnes2 is offline   Reply With Quote
Old 01-09-2009, 12:21 PM   #3
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

Quote:
Originally Posted by foolishbrat View Post
Dear all,

Usually we find this kind of quality error of Solexa tag

Code:
-33   31  -40  -34      -40  -40  -40   40       27  -27  -40  -40
Each four-numbers correspond to 1 base. Hence, the above
quality refer to length 3 tags (e.g. "tca").

My question are as follows:
  1. What is the reasonable way to find single number to represent each base? (e.g should we average the 4 figures or pick the highest score out of 4?)
  2. How can we interpret the figure? e.g. Is base with positive quality score is better than negative quality score?
  3. In general, how do people use this type of quality score information?
Sorry, you probably know most of this already but...

In general people would use the fastq files which are generated by the Gerald step of the GAPipeline. These files contain the base calls and an associated quality score (which is as estimation of how good the software thinks it's guess is). Most short read aligners used fastq files are their input and many (for example Maq) use this information to help find the correct alignment position. Fastq files look like this:

@complete:333:89
CGCCTTCGTATGTTTATCCTGCTTATCACATACTA
+complete:333:89
132057787<:9133*9,.65177;54;8)3)37/

The line following the @ contains the sequence and that following the + contains a ascii encoded number representing a quality score. There's a table here: http://www.genographia.org/portal/to...sheet.pdf/view to convert this to a "probability of error".

Quality scores are also useful in SNP calling you need more bases of low quality than high quality to call a SNP with confidence. You can also filter reads based on quality score in order to discard junk reads. All in all they are quite handy but you should make sure they are correctly calibrated (and therefore accurately assigned).

The prb file you've shown contains 4 quality scores for each base. So rather than just getting the probability that the correct base is right you also get probabilities for each of the other bases. So for example, you would be able to say "it was probably an A or a C, but it's very unlikely it was a G or a T". That might be useful information and some aligners are starting to take advantage of this information but it's not been fully exploited. However don't get too attached to these prb files as I believe they are set to disappear from the latest version of the GAPipeline.
new300 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO