Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • illumina quality scores first 2-3 bp reliability?

    Hi All-
    I apologize if this has been asked- I searched and could not find any answer that address this question-

    I was speaking to someone who is quite familiar with actually running HiSeq machines and asked a question about the origin of the lower quality scores for the first couple (3-4 seems to be when things settle down) bp of every read. At least for my data set, FastQC clearly shows lower (relatively) quality scores for the first few bp- showing 34 for reads 1-4 and 38 for the rest of the run.

    I got an answer I did not expect, and can not find discussion about.

    Basically- and this is a rough summary as I did not take notes. I was told that the way the quality scores are calculated during the run is through use of an algorithm that utilizes information about the read quality for a few preceding bases (unsure what its looking at but I think not simply the Q score, probably something about the relative ratio of the signal intensity for the various colors relative to each other). And that the first few reads do not (of course) fulfill this requirement for their algorithm and therefore are induced (artificially) to have lower scores... but should not be considered to be of low(er) quality.


    This is really not a serious concern as the FastQC report shows Q scores of 34 for the first 3-4 bp then jumps to 38 for basically the rest of the read, so I don't think I am going to loose anything by quality trimming as 34 is quite good... but I was curious if the general reasoning behind the "low" quality score for the first few bp is correct and if I should ever think about this again with respect to quality trimming reads.
    Last edited by rufessor; 11-10-2014, 10:41 AM.

  • #2
    Quality trimming to such high levels is generally not a good idea. For most purposes - mapping, assembly, merging, etc - you will get better results using a lower threshold, below 20. I tend to use 6-10 in general and 15 at the most.

    The quality scores of the first few bases are not accurate. As you said, I believe they are artificially lowered to compensate for the fact that the base caller has not been trained yet, or the cluster locations have not been nailed down precisely. The true accuracy is substantially higher, in general - though that might not be the case when sequencing low-diversity libraries where you selectively amplified some specific gene sequence.

    If you have a reference (or any assembly), you can run BBMap with the flag "mhist=mhist.txt" to produce a histogram of the match/substitution/deletion/insertion rates at every base location in the reads - this is the most accurate way that I know of to determine whether trimming is needed.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    8 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    8 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    66 views
    0 likes
    Last Post seqadmin  
    Working...
    X