Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • illumina quality scores first 2-3 bp reliability?

    Hi All-
    I apologize if this has been asked- I searched and could not find any answer that address this question-

    I was speaking to someone who is quite familiar with actually running HiSeq machines and asked a question about the origin of the lower quality scores for the first couple (3-4 seems to be when things settle down) bp of every read. At least for my data set, FastQC clearly shows lower (relatively) quality scores for the first few bp- showing 34 for reads 1-4 and 38 for the rest of the run.

    I got an answer I did not expect, and can not find discussion about.

    Basically- and this is a rough summary as I did not take notes. I was told that the way the quality scores are calculated during the run is through use of an algorithm that utilizes information about the read quality for a few preceding bases (unsure what its looking at but I think not simply the Q score, probably something about the relative ratio of the signal intensity for the various colors relative to each other). And that the first few reads do not (of course) fulfill this requirement for their algorithm and therefore are induced (artificially) to have lower scores... but should not be considered to be of low(er) quality.


    This is really not a serious concern as the FastQC report shows Q scores of 34 for the first 3-4 bp then jumps to 38 for basically the rest of the read, so I don't think I am going to loose anything by quality trimming as 34 is quite good... but I was curious if the general reasoning behind the "low" quality score for the first few bp is correct and if I should ever think about this again with respect to quality trimming reads.
    Last edited by rufessor; 11-10-2014, 10:41 AM.

  • #2
    Quality trimming to such high levels is generally not a good idea. For most purposes - mapping, assembly, merging, etc - you will get better results using a lower threshold, below 20. I tend to use 6-10 in general and 15 at the most.

    The quality scores of the first few bases are not accurate. As you said, I believe they are artificially lowered to compensate for the fact that the base caller has not been trained yet, or the cluster locations have not been nailed down precisely. The true accuracy is substantially higher, in general - though that might not be the case when sequencing low-diversity libraries where you selectively amplified some specific gene sequence.

    If you have a reference (or any assembly), you can run BBMap with the flag "mhist=mhist.txt" to produce a histogram of the match/substitution/deletion/insertion rates at every base location in the reads - this is the most accurate way that I know of to determine whether trimming is needed.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    18 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    22 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    17 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Working...
    X