Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina Error rate

    Hello everyone,

    That's must be a silly question but there is something I don't understand : the error rate of illumina sequencing (calculating using phix genome) is around 1%. But most of the read have a mean quality >= Q30 which is 0.1%.

    So what does the error rate actually mean ?

  • #2
    You can't estimate error rates with a linear average of log-transformed values. If you map to phiX with BBMap, like this:

    bbmap.sh in=reads.fq ref=phix.fa mhist=mhist.txt qhist=qhist.txt qahist=qahist.txt

    ...then graph qhist with something like Excel or R, you will see the difference between the linear and logarithmic averages, as well as the actual error rate. The actual error rate will much more closely resemble the lines of the logarithmic averages.

    qahist.txt (quality accuracy histogram), on the other hand, will show the actual measured quality values for each quality score, which will tell you how accurate their quality scores are. Normally they're not too far off, but it highly depends on the specific platform and software version.

    Comment


    • #3
      Sorry but I don't understand.

      What's the point of qhist histogram ?

      And about qahist, am I supposed to see that most of my reads have a quality greater or equal than 35 ?
      Attached Files

      Comment


      • #4
        The qhist shows you, per position in the read, what the expected error rate is (that's read1_log) and what the actual average error rate is (that's read1_measured). As you can see, those completely disagree so the quality scores are very inaccurate (almost meaningless) for this library, and that the error rates are high - the average error rate starts at Q20 (1%) and drops to Q11 (8%) by the end. The X axis is read position.

        The qahist is plotted incorrectly - you need to plot the "Quality" column as the X-axis, and the "TrueQuality" column as the Y-axis; discard the other columns.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          The qhist shows you, per position in the read, what the expected error rate is (that's read1_log) and what the actual average error rate is (that's read1_measured). As you can see, those completely disagree so the quality scores are very inaccurate (almost meaningless) for this library, and that the error rates are high -
          What do you mean 'expected' ?


          I'm not sure to understand the scale.
          Originally posted by Brian Bushnell View Post
          the average error rate starts at Q20 (1%) and drops to Q11 (8%) by the end.
          So I guess the Y axe is the log quality value, 20,000 is Q20 and you say it regarding to the orange curve right ?



          Originally posted by Brian Bushnell View Post

          The qahist is plotted incorrectly - you need to plot the "Quality" column as the X-axis, and the "TrueQuality" column as the Y-axis; discard the other columns.
          I have drawn the qahist again, does it show the quality is from Q8 to Q20 as well ?
          Attached Files

          Comment


          • #6
            Originally posted by ClemBuntu View Post
            What do you mean 'expected' ?
            The quality scores produced by the sequencer are expected (i.e., predicted or calculated) scores; see http://www.illumina.com/documents/pr...ity_scores.pdf for an explanation.

            Originally posted by ClemBuntu View Post
            I'm not sure to understand the scale.

            So I guess the Y axe is the log quality value, 20,000 is Q20 and you say it regarding to the orange curve right ?
            Yes, 20,000 is Q20. The relevant curves are the orange one (expected scores) vs. the gray one (actual scores based on your phiX data). These two curves should overlap if the expected scores accurately reflect the true error rate. They do not. As Brian indicated, the true error rate begins at Q20 and declines to Q11 at the end of the read. These quality scores are very low.

            Comment


            • #7
              Originally posted by ClemBuntu View Post
              I have drawn the qahist again, does it show the quality is from Q8 to Q20 as well ?
              Each point on the qahist indicates the claimed quality (from the quality scores) versus the measured quality (based on the alignment match/mismatch rate). So, for example, you have a point at X=22, Y=13. That means that if you take all bases with a stated quality score of Q22 (roughly 99.3% accuracy), on average, they have an error rate indicating Q13 (roughly 95% accuracy).

              Comment


              • #8
                Hey Brian, what's the point at (0,60)? It looks like all the Q0s are 99.9999% accurate!

                [ClemBuntu, ignore this message. It's a joke. The lowest quality produced by Illumina is Q2]

                Comment


                • #9
                  That's actually a valid point! The quality scores vary by platform and software version (I'm guessing this is NextSeq V1 chemistry, or HiSeq 4000). Normally, for non-binned quality scores (like HiSeq 2000), there is 0 (for N), 2, then 5-41. Quite often Q2 bases are more accurate than Q5 bases, as 2 has a special meaning. But sometimes, called bases (A, C, G, T) are produced with Q0 assigned. It's normally very few, under 100. I suspect it's a bug in Casava. But, due to the fact that there are so few, it's not uncommon for them to all match the reference. To keep the axes finite, I cap all quality values at 60, but technically, in this case, the Q0 bases are 100% accurate (Q infinity). I wouldn't count on it in general, though

                  Comment


                  • #10
                    Thanks, Brian. As always, you're a fount of knowledge.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X