Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • quality scores vs prb files

    I think that the prb files give a probablity score for each base. How does this differ from the quality score?
    Thanks

  • #2
    I'll give it a try...

    The Illumina pipeline produces Q scores from the prb data. The scores are encoded as a single ascii character per base in the *_sequence.txt files. The formula they use is Q=10*log(P/(1-P))+64, where P is the probability that the base was called correctly. Given that the base caller is just picking the likeliest base, P is just the largest of the four scores in the prb file.

    Unfortunately, the official fastq format defines a different encoding, Q=-10log(E)+33, where E is the probability that the base was called *incorrectly*. That's 1-P, or the sum of the probabilities of the other three bases.

    The two log terms are asymptotically equal as Q > 15 or so. But the different scaling factors (64 v 33) used to convert to ascii obviously matter.

    Comment


    • #3
      Thank you! That helps. But what is the basis for the scaling factors?

      Comment


      • #4
        There's no magic to the scaling factors. They're there just to map the 0-60 or so range of Q scores into the printable range of ascii characters, bypassing characters like carriage return and space, which would mess things up. Two different folks got out their ascii charts and picked two different starting points.

        Comment


        • #5
          so.. with an aligner like maq, wouldnt it be beneficial to use the .prb files instead of fastq files? from a prb file you would know what the next most likely base is after the called base eg just making up numbers here

          just say a position had the probabilities
          A C G T
          30 20 0 0

          so in the fastq file it would be called as an A, with some lowish quality, but you lose the information the C is also quite likely - but maq would still align it with a G or T?

          am i correct?

          Comment


          • #6
            Originally posted by frozenlyse View Post
            so.. with an aligner like maq, wouldnt it be beneficial to use the .prb files instead of fastq files? from a prb file you would know what the next most likely base is after the called base eg just making up numbers here

            just say a position had the probabilities
            A C G T
            30 20 0 0

            so in the fastq file it would be called as an A, with some lowish quality, but you lose the information the C is also quite likely - but maq would still align it with a G or T?

            am i correct?
            Yep you are correct, it's something aligners don't currently exploit and my understanding is that Illumina are looking to get rid of the 4 quality scores, which is a shame. I'm looking in to creating a kind of "fast4" sequence format which stores the 4 (or more) quality scores and will at some point be generating 4 scores with Swift (the primary data analysis tool I've been working on). If anyone has any interest in this drop me a line.

            Comment


            • #7
              I've been performing experiments with Gap5's consensus algorithm using 1 vs 4 confidence values and as expected the results show using all 4 is a significant improvement - about a 20% reduction in incorrect calls and better discrimination via consensus confidence too.

              I even saw a case of a 2 deep region called T and G where the consensus, was (correctly) called C as it was 2nd highest in both T and G calls neither of which had significant G or T in their secondary intensities. For SNP calling I would expect the improvement to be much larger still.

              Indeed the Staden group was pushing for 4 quality values many years ago, to the extent that the SCF standard published in 1992 made provision for storing 4 values per base in the chromatograms files. So I was definitely pleased to *finally* see an instrument manufacturer starting to use them. The idea of log odds is great too. They just need to improve the calibration so all 4 are calibrated rather than just 1.

              James

              Comment


              • #8
                hear hear !

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                68 views
                0 likes
                Last Post seqadmin  
                Working...
                X