Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gary
    Member
    • Dec 2009
    • 16

    Duplication level of RNA-seq data

    Hello everyone!

    I got my 100bp paired-end RNA-seq data today, FastQC told me that the duplication rate is above 60%. I searched around and found that it is common to get a high dup level with RNA-seq. Is that normal?

    Should I remove the duplication? I heard some discussions in this forum that if the duplicates were removed then I cannot compare the highly expressed genes since the max depth of coverage at one point is 200 with 100bp sequencing data.

    thanks!
  • Chipper
    Senior Member
    • Mar 2008
    • 323

    #2
    200 is the maximum coverage per base if you have unstranded single end reads, with paired ends you can have many fragments starting at the same point if the other end differs.

    FastQC only looks at the reads, what you should do is calculate the library complexity after alignment (e.g. with Picard). Looking at the alignments at lowly expressed genes also helps to determine if the library is over-sequenced.

    Comment

    • gary
      Member
      • Dec 2009
      • 16

      #3
      Thank you Chipper! I will try your suggestions.

      In your experiences, do you think the >=60% duplication level given by FastQC is too high? Or I will need to look at the alignment results to see if that was too high.

      Comment

      • NextGenSeq
        Senior Member
        • Apr 2009
        • 482

        #4
        What was the yield? We usually aim for less than 100 ng/ul but some RNA samples amplify better than others.

        Comment

        • arkal
          advancing one byte at a time!
          • Jun 2011
          • 56

          #5
          In my experience (from what i've seen and what i've read), you can expect to see 60-90% duplication when u run a fastqc on ur data! Don't really know how it correlates to the real picture though!

          Comment

          • TonyBrooks
            Senior Member
            • Jun 2009
            • 303

            #6
            FastQC bases it's duplication estimates on the first 50bp of sequence only. It also makes no allowance for paired end data. High FastQC duplication rates for RNA-Seq is normal.
            To get a better idea you need to look at the mapping co-ordinates.
            As Chipper says, use the Picard library complexity estimator. For example, I just ran and RNA-Seq sample with 64.59% (Read1) and 58.13% (Read 2) FastQC duplication, but only 0.32% duplication using the Picard library complexity estimator.

            Comment

            • Baoqing
              Member
              • Jan 2013
              • 24

              #7
              Interpretation of the PICARD results

              Hi, Guys
              I was also trying to estimate library complexity with PICARD with my paired end data, I used the tophat aligned reads as input, according to the picard, "One or more files to combine and estimate library complexity from." what does it exactly mean ? Does multiple inputs mean the bam files from each duplicate of biological samples?

              If so, how to add? Should I just add extra files in the INPUT argument

              java -Xmx2g -jar ~/Desktop/apps/picard-tools-1.92/EstimateLibraryComplexity.jar INPUT= accepted_hits025.bam <more bam files here?> OUTPUT= picard_file MIN_IDENTICAL_bases=6 MAX_DIFF_RATE=0.02

              I also get a table from the run of the command, but any clue how to understand this file? I also attached part of the table (It is too long to attach)

              Thank you in advance!
              ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
              LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
              Unknown 0 6501528 0 0 2054652 1644248 0.316026 27100851

              ## HISTOGRAM java.lang.Integer
              duplication_group_count Unknown
              1 3808007
              2 376471
              3 78976
              4 57845
              5 19068
              6 26550
              7 8180
              8 17601
              9 4826
              10 10475
              11 3229
              12 5810

              Comment

              • Chipper
                Senior Member
                • Mar 2008
                • 323

                #8
                Picard estimates that your library has 27100851 molecules.

                Comment

                • Baoqing
                  Member
                  • Jan 2013
                  • 24

                  #9
                  Thanks. Still not quite clear to me. Does that mean 27,100,851 mRNA, exon or something else? seems unlikely are mRNAs, the read pair examined in total is 6,501,528. Also, it seems that the duplicate_group_count are some kind of id, how could i get back and check those duplicated reads? Really appreciate your help in clarifying this.

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM
                  • SEQadmin2
                    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                    by SEQadmin2


                    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                    Introduction

                    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                    05-22-2026, 06:42 AM
                  • SEQadmin2
                    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                    by SEQadmin2

                    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                    05-06-2026, 09:04 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Today, 08:59 AM
                  0 responses
                  11 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  21 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  17 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-28-2026, 11:40 AM
                  0 responses
                  31 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...