Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina Error rate

    Hello everyone,

    That's must be a silly question but there is something I don't understand : the error rate of illumina sequencing (calculating using phix genome) is around 1%. But most of the read have a mean quality >= Q30 which is 0.1%.

    So what does the error rate actually mean ?

  • #2
    You can't estimate error rates with a linear average of log-transformed values. If you map to phiX with BBMap, like this:

    bbmap.sh in=reads.fq ref=phix.fa mhist=mhist.txt qhist=qhist.txt qahist=qahist.txt

    ...then graph qhist with something like Excel or R, you will see the difference between the linear and logarithmic averages, as well as the actual error rate. The actual error rate will much more closely resemble the lines of the logarithmic averages.

    qahist.txt (quality accuracy histogram), on the other hand, will show the actual measured quality values for each quality score, which will tell you how accurate their quality scores are. Normally they're not too far off, but it highly depends on the specific platform and software version.

    Comment


    • #3
      Sorry but I don't understand.

      What's the point of qhist histogram ?

      And about qahist, am I supposed to see that most of my reads have a quality greater or equal than 35 ?
      Attached Files

      Comment


      • #4
        The qhist shows you, per position in the read, what the expected error rate is (that's read1_log) and what the actual average error rate is (that's read1_measured). As you can see, those completely disagree so the quality scores are very inaccurate (almost meaningless) for this library, and that the error rates are high - the average error rate starts at Q20 (1%) and drops to Q11 (8%) by the end. The X axis is read position.

        The qahist is plotted incorrectly - you need to plot the "Quality" column as the X-axis, and the "TrueQuality" column as the Y-axis; discard the other columns.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          The qhist shows you, per position in the read, what the expected error rate is (that's read1_log) and what the actual average error rate is (that's read1_measured). As you can see, those completely disagree so the quality scores are very inaccurate (almost meaningless) for this library, and that the error rates are high -
          What do you mean 'expected' ?


          I'm not sure to understand the scale.
          Originally posted by Brian Bushnell View Post
          the average error rate starts at Q20 (1%) and drops to Q11 (8%) by the end.
          So I guess the Y axe is the log quality value, 20,000 is Q20 and you say it regarding to the orange curve right ?



          Originally posted by Brian Bushnell View Post

          The qahist is plotted incorrectly - you need to plot the "Quality" column as the X-axis, and the "TrueQuality" column as the Y-axis; discard the other columns.
          I have drawn the qahist again, does it show the quality is from Q8 to Q20 as well ?
          Attached Files

          Comment


          • #6
            Originally posted by ClemBuntu View Post
            What do you mean 'expected' ?
            The quality scores produced by the sequencer are expected (i.e., predicted or calculated) scores; see http://www.illumina.com/documents/pr...ity_scores.pdf for an explanation.

            Originally posted by ClemBuntu View Post
            I'm not sure to understand the scale.

            So I guess the Y axe is the log quality value, 20,000 is Q20 and you say it regarding to the orange curve right ?
            Yes, 20,000 is Q20. The relevant curves are the orange one (expected scores) vs. the gray one (actual scores based on your phiX data). These two curves should overlap if the expected scores accurately reflect the true error rate. They do not. As Brian indicated, the true error rate begins at Q20 and declines to Q11 at the end of the read. These quality scores are very low.

            Comment


            • #7
              Originally posted by ClemBuntu View Post
              I have drawn the qahist again, does it show the quality is from Q8 to Q20 as well ?
              Each point on the qahist indicates the claimed quality (from the quality scores) versus the measured quality (based on the alignment match/mismatch rate). So, for example, you have a point at X=22, Y=13. That means that if you take all bases with a stated quality score of Q22 (roughly 99.3% accuracy), on average, they have an error rate indicating Q13 (roughly 95% accuracy).

              Comment


              • #8
                Hey Brian, what's the point at (0,60)? It looks like all the Q0s are 99.9999% accurate!

                [ClemBuntu, ignore this message. It's a joke. The lowest quality produced by Illumina is Q2]

                Comment


                • #9
                  That's actually a valid point! The quality scores vary by platform and software version (I'm guessing this is NextSeq V1 chemistry, or HiSeq 4000). Normally, for non-binned quality scores (like HiSeq 2000), there is 0 (for N), 2, then 5-41. Quite often Q2 bases are more accurate than Q5 bases, as 2 has a special meaning. But sometimes, called bases (A, C, G, T) are produced with Q0 assigned. It's normally very few, under 100. I suspect it's a bug in Casava. But, due to the fact that there are so few, it's not uncommon for them to all match the reference. To keep the axes finite, I cap all quality values at 60, but technically, in this case, the Q0 bases are 100% accurate (Q infinity). I wouldn't count on it in general, though

                  Comment


                  • #10
                    Thanks, Brian. As always, you're a fount of knowledge.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Recent Advances in Sequencing Analysis Tools
                      by seqadmin


                      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                      05-06-2024, 07:48 AM
                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 06:35 AM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 02:46 PM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-07-2024, 06:57 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-06-2024, 07:17 AM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X