Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • foolishbrat
    Member
    • Nov 2008
    • 45

    Interpreting Quality Score (Solexa)

    Dear all,

    Usually we find this kind of quality error of Solexa tag

    Code:
    -33   31  -40  -34      -40  -40  -40   40       27  -27  -40  -40
    Each four-numbers correspond to 1 base. Hence, the above
    quality refer to length 3 tags (e.g. "tca").

    My question are as follows:
    1. What is the reasonable way to find single number to represent each base? (e.g should we average the 4 figures or pick the highest score out of 4?)
    2. How can we interpret the figure? e.g. Is base with positive quality score is better than negative quality score?
    3. In general, how do people use this type of quality score information?
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    Avergeing would be bad. Each number in the set of 4 represents the score for A,C,G, or T respectively. So the sequence for your little bit there is CTA, because in the first base, the second number is the highest, and in the second 4-some, the fourth base is the highest, and in the third, the first base is the highest.

    The scores are Solexa quality scores, not exactly the same as Sanger quality score, though when the score is > 15, the two are virtually identical. There is a conversion equation around to convert the Solexa scores to Sanger scores, and an equation telling you what the error rate of a given Sanger quality score are supposed to be.

    A lot of alignment programs don't use the quality scores at all in alignment, though they will output the quality scores of mismatches, which helps you determine how likely it is that teh mismatch is a real polymorphism, and not an error. But read depth probably tells you more than quality scores when it comes to SNPs.

    Comment

    • new300
      Member
      • Mar 2008
      • 50

      #3
      Originally posted by foolishbrat View Post
      Dear all,

      Usually we find this kind of quality error of Solexa tag

      Code:
      -33   31  -40  -34      -40  -40  -40   40       27  -27  -40  -40
      Each four-numbers correspond to 1 base. Hence, the above
      quality refer to length 3 tags (e.g. "tca").

      My question are as follows:
      1. What is the reasonable way to find single number to represent each base? (e.g should we average the 4 figures or pick the highest score out of 4?)
      2. How can we interpret the figure? e.g. Is base with positive quality score is better than negative quality score?
      3. In general, how do people use this type of quality score information?
      Sorry, you probably know most of this already but...

      In general people would use the fastq files which are generated by the Gerald step of the GAPipeline. These files contain the base calls and an associated quality score (which is as estimation of how good the software thinks it's guess is). Most short read aligners used fastq files are their input and many (for example Maq) use this information to help find the correct alignment position. Fastq files look like this:

      @complete:333:89
      CGCCTTCGTATGTTTATCCTGCTTATCACATACTA
      +complete:333:89
      132057787<:9133*9,.65177;54;8)3)37/

      The line following the @ contains the sequence and that following the + contains a ascii encoded number representing a quality score. There's a table here: http://www.genographia.org/portal/to...sheet.pdf/view to convert this to a "probability of error".

      Quality scores are also useful in SNP calling you need more bases of low quality than high quality to call a SNP with confidence. You can also filter reads based on quality score in order to discard junk reads. All in all they are quite handy but you should make sure they are correctly calibrated (and therefore accurately assigned).

      The prb file you've shown contains 4 quality scores for each base. So rather than just getting the probability that the correct base is right you also get probabilities for each of the other bases. So for example, you would be able to say "it was probably an A or a C, but it's very unlikely it was a G or a T". That might be useful information and some aligners are starting to take advantage of this information but it's not been fully exploited. However don't get too attached to these prb files as I believe they are set to disappear from the latest version of the GAPipeline.

      Comment

      Latest Articles

      Collapse

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, 06-09-2026, 11:58 AM
      0 responses
      23 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-05-2026, 10:09 AM
      0 responses
      29 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-04-2026, 08:59 AM
      0 responses
      39 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-02-2026, 12:03 PM
      0 responses
      61 views
      0 reactions
      Last Post SEQadmin2  
      Working...