Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Periodical illumina read length distribution after trimming of low-quality bases

    In my NGS data analysis, before mapping, I trimmed low-quality bases (<Q20) from 3' ends until a high quality (≥Q20) base appears. After that, I plotted the distribution of read length and obtained the weird periodical read length distribution. Please see attached.

    In the graph, length distributions from different lanes or tiles were drawn in different colors. Frequencies of reads were oscillated with 5bp intervals.

    I also saw this kind of weird length distribution for other our RNA-seq and genome sequence dataset, and RNA-seq data from SRA as well.

    Does anyone know the reason why such periodical length distribution was appeared after trimming?

    Thanks in advance.
    Attached Files

  • #2
    I can't remember any details but I do recall hearing once that there is something about the Illumina quality scoring algorithms which creates these 5bp cycles.

    Comment


    • #3
      Thank you, kmcarr.
      Do you mean that such weird distribution is caused by the base calling algorithm in the illumina pipeline?

      Can we just ignore the length distribution after trimming of low-quality bases? We would not worry about it?

      Comment


      • #4
        Funny that you mention this, I have done something quite similar recently.

        I wanted to find out whether the increase in sequencing errors towards later sequencing cycles (which is equivalent to a drop in Phred quality) can be described by some kind of mathematical formula. I used a couple of sequence files to determine the starting position of poor qualities. Poor qualities were defined as reads which exceeded a certain number of low quality basecalls in total (in the attached figure there had to be at least 8 quality values below 30). I tried various different thresholds (qualities 10, 15, 20, 30) but the graph does not change much.

        Interestingly the pattern I got did not increase steadily towards later cycles (as I expected), and I also saw a periodicity of - you might have guessed - 5 bp for poor quality starting positions. This seems to be indeed a feature of the Illumina pipeline algorithms used. Even though it looks artefactual and I found this slightly worrying I don't think one can do much about it, as it is present in all samples irrespective of their origin.

        This led me to the conclusion that the increased error rate one sees towards the end of longer reads is not chemistry or run-time related but seems to be largely the cumulative effect of these spikes of low quality basecalls which are introduced into the reads with a periodicity of 5 bp. Quite odd, isn't it?
        Attached Files

        Comment


        • #5
          Thank you fkrueger.
          I really think so, it's weird. I hope this hidden bias will be improved in the near future.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Working...
          X