Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • inbarpl
    Junior Member
    • Jul 2011
    • 4

    duplicate reads in Illumina short, single end reads of RNAseq data

    Dear all,
    I am performing QC to fastq files of Illumina 76bp length single end reads of RNAseq data. I keep getting indications that there are many PCR duplicates; Fastqc report indicates 74.1% sequence duplication level, but no overrepresented sequences list is given. When SAMtools Duplicates removal (rmdup) option is performed, 74.9% of the sequences were found to be duplicates. When comparing the mapping results before and after the duplicates removal I see that the highly expressed genes, has the highest fraction of duplicates (which were removed). This is not the first illumina single end short reads RNA seq experiment that I see this phenomena.
    I keep wondering whether this is an experimental artifact (then, should we repeat on the experiment?) or just a possible valid result (in this scenario I must than believe that few different inserts which were generated from different copies of the same kind of RNA transcripts were cleaved at the same base, leading to identical 5’ end of an insert which are then sequenced).
    I would be glad to know your opinion in this matter
    Many thanks
    Inbar
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    With 76-mer single reads, even for a perfectly diverse library, the theortical depth limit at any point is 152 if you use rmdup. So any gene that has more coverage than that ceiling is going to be whacked down to 152x. So you won't be able to quantify expression of those highly expressed genes.

    That library sounds awfully non-diverse, but if your sample is dominated by a couple of genes at super high levels, maybe it's accurate. I guess you could examine the highly represented reads. Do they cover whole genes as if the sample had a huge amount of that RNA? Or is there just one position that has 100K reads, and adjacent positions have much less?

    Comment

    • arvid
      Senior Member
      • Jul 2011
      • 156

      #3
      Exactly, I'd have a look at the shape of the read alignments before de-duplication to see whether it looks like PCR or simply very high coverage. 74 % isn't exceptionally high, I usually see 60-80 % for libraries which look OK.
      In any case, de-duplication on reads for downstream quantification is a delicate matter, as it is difficult to discern PCR copies from valid, high-coverage, reads as swbarnes2 pointed out.

      Comment

      • inbarpl
        Junior Member
        • Jul 2011
        • 4

        #4
        swbarnes2, Thanks a lot for your answer,
        I guess this is exactly the case in my data set, the samples are from Arabidopsis so I guess that Rubisco gene is the dominant in the library. I will check what you've recommended using IGV. Sorry for my ignorance but could you please explain the definition of "theoretical depth limit" and the calculation you did to extract it for my parameters ?
        many thanks
        Inbar

        Comment

        • swbarnes2
          Senior Member
          • May 2008
          • 910

          #5
          Originally posted by inbarpl View Post
          swbarnes2, Thanks a lot for your answer,
          I guess this is exactly the case in my data set, the samples are from Arabidopsis so I guess that Rubisco gene is the dominant in the library. I will check what you've recommended using IGV. Sorry for my ignorance but could you please explain the definition of "theoretical depth limit" and the calculation you did to extract it for my parameters ?
          many thanks
          Inbar
          If you filter single end data for uniqueness, you will have exactly two reads beginning at every point; one in the forward direction, one in the reverse.

          So with 76-mers, the base at position 100 will be covered by 152 reads, 76 in the forward direction, starting at bases 35-100, and 76 in the reverse direction, starting from 100-175. You can't have three reads all running forward, starting at position 75, becuae your rmdup will get rid of two of them.

          With paired end, you can have three reads which run in the forward direction starting at base 75, if their mates all start at different sites, because if their mates are at different sites, they must have come from different fragments. So there's a ceiling there too, depending on how variant your insert sizes are, but it's far higher than the ceiling for single read runs.

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM
          • SEQadmin2
            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
            by SEQadmin2


            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


            Introduction

            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
            05-22-2026, 06:42 AM
          • SEQadmin2
            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
            by SEQadmin2

            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
            05-06-2026, 09:04 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Today, 08:59 AM
          0 responses
          9 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          21 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 11:40 AM
          0 responses
          17 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 05-28-2026, 11:40 AM
          0 responses
          30 views
          0 reactions
          Last Post SEQadmin2  
          Working...