Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpreting Quality Score (Solexa)

    Dear all,

    Usually we find this kind of quality error of Solexa tag

    Code:
    -33   31  -40  -34      -40  -40  -40   40       27  -27  -40  -40
    Each four-numbers correspond to 1 base. Hence, the above
    quality refer to length 3 tags (e.g. "tca").

    My question are as follows:
    1. What is the reasonable way to find single number to represent each base? (e.g should we average the 4 figures or pick the highest score out of 4?)
    2. How can we interpret the figure? e.g. Is base with positive quality score is better than negative quality score?
    3. In general, how do people use this type of quality score information?

  • #2
    Avergeing would be bad. Each number in the set of 4 represents the score for A,C,G, or T respectively. So the sequence for your little bit there is CTA, because in the first base, the second number is the highest, and in the second 4-some, the fourth base is the highest, and in the third, the first base is the highest.

    The scores are Solexa quality scores, not exactly the same as Sanger quality score, though when the score is > 15, the two are virtually identical. There is a conversion equation around to convert the Solexa scores to Sanger scores, and an equation telling you what the error rate of a given Sanger quality score are supposed to be.

    A lot of alignment programs don't use the quality scores at all in alignment, though they will output the quality scores of mismatches, which helps you determine how likely it is that teh mismatch is a real polymorphism, and not an error. But read depth probably tells you more than quality scores when it comes to SNPs.

    Comment


    • #3
      Originally posted by foolishbrat View Post
      Dear all,

      Usually we find this kind of quality error of Solexa tag

      Code:
      -33   31  -40  -34      -40  -40  -40   40       27  -27  -40  -40
      Each four-numbers correspond to 1 base. Hence, the above
      quality refer to length 3 tags (e.g. "tca").

      My question are as follows:
      1. What is the reasonable way to find single number to represent each base? (e.g should we average the 4 figures or pick the highest score out of 4?)
      2. How can we interpret the figure? e.g. Is base with positive quality score is better than negative quality score?
      3. In general, how do people use this type of quality score information?
      Sorry, you probably know most of this already but...

      In general people would use the fastq files which are generated by the Gerald step of the GAPipeline. These files contain the base calls and an associated quality score (which is as estimation of how good the software thinks it's guess is). Most short read aligners used fastq files are their input and many (for example Maq) use this information to help find the correct alignment position. Fastq files look like this:

      @complete:333:89
      CGCCTTCGTATGTTTATCCTGCTTATCACATACTA
      +complete:333:89
      132057787<:9133*9,.65177;54;8)3)37/

      The line following the @ contains the sequence and that following the + contains a ascii encoded number representing a quality score. There's a table here: http://www.genographia.org/portal/to...sheet.pdf/view to convert this to a "probability of error".

      Quality scores are also useful in SNP calling you need more bases of low quality than high quality to call a SNP with confidence. You can also filter reads based on quality score in order to discard junk reads. All in all they are quite handy but you should make sure they are correctly calibrated (and therefore accurately assigned).

      The prb file you've shown contains 4 quality scores for each base. So rather than just getting the probability that the correct base is right you also get probabilities for each of the other bases. So for example, you would be able to say "it was probably an A or a C, but it's very unlikely it was a G or a T". That might be useful information and some aligners are starting to take advantage of this information but it's not been fully exploited. However don't get too attached to these prb files as I believe they are set to disappear from the latest version of the GAPipeline.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      68 views
      0 likes
      Last Post seqadmin  
      Working...
      X