Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bhuv74
    Junior Member
    • Jun 2010
    • 6

    Duplicate reads issue

    Hi,

    We are seeing 65% to 85% duplicate reads consistently in our RNA seq experiments. We use the Truseq and Nugen protocols for the library prep. We have done many RNA seq experiments (both single read, Paired-end) with both Truseq and Nugen protocols and see high duplicate % consistently. I know there have been many discussions about the duplicate reads issue. I would like to know if this is common issue many are facing with RNA seq experiments. I would appreciate your suggestions if any modifications in library prep would improve our results.

    Thanks in advance.
  • pmiguel
    Senior Member
    • Aug 2008
    • 2328

    #2
    Originally posted by bhuv74 View Post
    Hi,

    We are seeing 65% to 85% duplicate reads consistently in our RNA seq experiments. We use the Truseq and Nugen protocols for the library prep. We have done many RNA seq experiments (both single read, Paired-end) with both Truseq and Nugen protocols and see high duplicate % consistently. I know there have been many discussions about the duplicate reads issue. I would like to know if this is common issue many are facing with RNA seq experiments. I would appreciate your suggestions if any modifications in library prep would improve our results.

    Thanks in advance.
    "Duplicate" in what sense? Sharing the same start point for one read, both reads, something else? What is your assay for identifying a "duplicate" read?

    --
    Phillip

    Comment

    • bhuv74
      Junior Member
      • Jun 2010
      • 6

      #3
      Thanks Philip.

      We use the Fastqc & FastX tools to look at the duplicate reads, also our bioinformatics uses internal program to check before and after mapping. Our assumption is duplicates reads shares the same starting point and exact match of all reads.

      Bhuv

      Comment

      • blancha
        Senior Member
        • May 2013
        • 367

        #4
        Just out of curiosity.

        Are these mammalian transcriptomes?
        Do you have any reason to believe the expression levels would be skewed towards a few genes?
        Are you using poly-A enrichment or ribosomal depletion?
        What is the starting amount of RNA?
        How many PCR cycles?

        Comment

        • pmiguel
          Senior Member
          • Aug 2008
          • 2328

          #5
          Originally posted by bhuv74 View Post
          Thanks Philip.

          We use the Fastqc & FastX tools to look at the duplicate reads, also our bioinformatics uses internal program to check before and after mapping. Our assumption is duplicates reads shares the same starting point and exact match of all reads.

          Bhuv
          "Your assumption"?

          FastQC does not do the assay you describe above. See here. FastQC checks the first 200,000 sequences in a fastq file for duplication of their first 50bases (sequences longer than 75 bases are truncated to 50.) Note -- there is no check at all of the corresponding paired read.

          This means that if you have a 2000 nt. transcript that is fragmented completely randomly into minumum 100 nt. fragments, there are only 1900 possible starting places. 3800, if you include the reverse complement strand. So that means if your data set collected 7200 read pairs for a given message, 50% of them would have to duplicate a previous read's starting position. Minimum, best case.

          --
          Phillip

          Comment

          • bhuv74
            Junior Member
            • Jun 2010
            • 6

            #6
            Always we use high starting material and try to reduce PCR cycle as much as possible. For thr Truseq protocol, we use 1ug as starting material and 8 PCR cycles to amplify. I am not sure if this is related to low complexity issue because we see this high % in all sample types.

            Let me know if you need any additional information to suggest.

            Comment

            • bhuv74
              Junior Member
              • Jun 2010
              • 6

              #7
              Originally posted by blancha View Post
              Just out of curiosity.

              Are these mammalian transcriptomes?
              Do you have any reason to believe the expression levels would be skewed towards a few genes?
              Are you using poly-A enrichment or ribosomal depletion?
              What is the starting amount of RNA?
              How many PCR cycles?
              1. Yes. Most experiments are mammalian transcriptomes.
              2. In some experiments we found that several genes consumed the most duplicate reads. I could be possible but we seeing consistently high duplicate %, that's our worry.
              3. We use ribosomal depletion for all our experiments.
              4. 1 ug for truseq protocol and 20 to 30ng for Ovation protocol.
              5. We use 8 PCR cycles for amplification.

              Bhuv

              Comment

              • bilyl
                Member
                • Aug 2013
                • 52

                #8
                Originally posted by bhuv74 View Post
                1. Yes. Most experiments are mammalian transcriptomes.
                2. In some experiments we found that several genes consumed the most duplicate reads. I could be possible but we seeing consistently high duplicate %, that's our worry.
                3. We use ribosomal depletion for all our experiments.
                4. 1 ug for truseq protocol and 20 to 30ng for Ovation protocol.
                5. We use 8 PCR cycles for amplification.

                Bhuv
                The most important question is how many reads are you using for your analysis?

                Comment

                • snetmcom
                  Senior Member
                  • Oct 2008
                  • 159

                  #9
                  Did you perform any library QC? Are you certain the Ribo depletetion was a success? That could account for a surplus of duplicate reads.

                  Comment

                  • thomasblomquist
                    Member
                    • Jul 2012
                    • 68

                    #10


                    See above. Careful what you assume is a "duplicate."

                    There are fragmentation and ligation hotspots that may mimic duplicate reads of an amplified target. Need internal controls to control for efficiency of capture for accurate quantification. http://journals.plos.org/plosone/art...l.pone.0079120

                    Comment

                    • pmiguel
                      Senior Member
                      • Aug 2008
                      • 2328

                      #11
                      If you are really concerned about PCR duplicates, then you might want to do qPCR prior to PCR amplification. Then you have some feel for the total pool of pre-amp amplifiable library molecules present in the library.

                      But this all probably stems from FastQC's big red X phenomenon. A trip to FastQC's documentation should calm your fears:

                      In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates.
                      But I think there must be better tools for assessing the amount of PCR duplication in a library.

                      Finally, let me just point a finger at Illumina -- guys this is entirely your fault. You should have a low concentration library denaturation/neutralization/clustering protocol that works on double stranded libraries.

                      Think about the assay we are doing here. Once you ligate the adapters on, why do a PCR amplification? What does it gain you? A bunch of head-aches really. But it gets you up to a >2nM library concentration that can be used in a standard Illumina denaturation/neutralization. You start at 2nM, then dilute down to 0.02nM (or lower) to cluster. That means we are using PCR to pump up the concentration of libraries a least 25x higher than needed for clustering. Near as I can tell the only reason to do this is so that a buffer can then be used to neutralize, rather than an acid.

                      Okay, also, once amplified you can easily use an Agilent chip to assay your library without worrying about being below the sensitivity threshold.

                      --
                      Phillip

                      Comment

                      • bhuv74
                        Junior Member
                        • Jun 2010
                        • 6

                        #12
                        Originally posted by bilyl View Post
                        The most important question is how many reads are you using for your analysis?
                        We use ~ 55 to 60 M reads for the analysis per sample

                        Comment

                        • bhuv74
                          Junior Member
                          • Jun 2010
                          • 6

                          #13
                          Originally posted by snetmcom View Post
                          Did you perform any library QC? Are you certain the Ribo depletetion was a success? That could account for a surplus of duplicate reads.
                          We check the library size distribution using bioanalyzer. The average peak size of the libraries are close to 260bp. We quantify the libraries using Kapa qPCR.

                          We never QC'd the samples Post Ribo-Depletion since the Truseq protocol doesn't recommend checking the ribo-depleted samples. I quantified and QC'd the post-ribo-depleted samples for the current libraries i am preparing. The traces look good.

                          Bhuv

                          Comment

                          • bilyl
                            Member
                            • Aug 2013
                            • 52

                            #14
                            Originally posted by pmiguel View Post
                            If you are really concerned about PCR duplicates, then you might want to do qPCR prior to PCR amplification. Then you have some feel for the total pool of pre-amp amplifiable library molecules present in the library.

                            But this all probably stems from FastQC's big red X phenomenon. A trip to FastQC's documentation should calm your fears:



                            But I think there must be better tools for assessing the amount of PCR duplication in a library.

                            Finally, let me just point a finger at Illumina -- guys this is entirely your fault. You should have a low concentration library denaturation/neutralization/clustering protocol that works on double stranded libraries.

                            Think about the assay we are doing here. Once you ligate the adapters on, why do a PCR amplification? What does it gain you? A bunch of head-aches really. But it gets you up to a >2nM library concentration that can be used in a standard Illumina denaturation/neutralization. You start at 2nM, then dilute down to 0.02nM (or lower) to cluster. That means we are using PCR to pump up the concentration of libraries a least 25x higher than needed for clustering. Near as I can tell the only reason to do this is so that a buffer can then be used to neutralize, rather than an acid.

                            Okay, also, once amplified you can easily use an Agilent chip to assay your library without worrying about being below the sensitivity threshold.

                            --
                            Phillip
                            I don't think your comment about the PCR library is quite right.

                            Denaturing by NaOH and neutralizing with buffer doesn't necessitate pumping up the DNA concentration. Let's use the MiSeq as an example. You start with 5ul of a 2nM library, which is 0.01 picomoles of DNA. After denaturation with NaOH and dilution, you take 600ul (0.006 pmol) for sequencing (assuming negligible PhiX). So you really only need ~2-3x the amount of DNA that goes into the sequencer, not >25x.

                            With regards to the actual denaturation protocol, I'm certain something like denaturing a library in hot formamide and immediately diluting to 1ml with cold buffer will work too. You'd have negligible amounts of formamide left over, but I'm not sure what the advantage of this would be compared to NaOH.

                            Comment

                            • pmiguel
                              Senior Member
                              • Aug 2008
                              • 2328

                              #15
                              Originally posted by bilyl View Post
                              I don't think your comment about the PCR library is quite right.

                              Denaturing by NaOH and neutralizing with buffer doesn't necessitate pumping up the DNA concentration. Let's use the MiSeq as an example. You start with 5ul of a 2nM library, which is 0.01 picomoles of DNA. After denaturation with NaOH and dilution, you take 600ul (0.006 pmol) for sequencing (assuming negligible PhiX). So you really only need ~2-3x the amount of DNA that goes into the sequencer, not >25x.

                              With regards to the actual denaturation protocol, I'm certain something like denaturing a library in hot formamide and immediately diluting to 1ml with cold buffer will work too. You'd have negligible amounts of formamide left over, but I'm not sure what the advantage of this would be compared to NaOH.
                              Yeah, I meant for the cbot/HiSeq.

                              The MiSeq has a much poorer yield of clusters/per molecule of library added to the the cassette than the HiSeq. 2nM is roughly 1.2 billion amplicons/ul. So if we start with 5 ul we are at 6 billion library amplicons. How many clusters do you get per v2 MiSeq run? 15 million? That corresponds to about 0.25% of the library amplicons yielding clusters. If you only count the 0.6 ml of, lets say, 10pM library that gets loaded into the cassette. Then 3.6 billion amplicons are loaded. So that gives you a 0.4% yield.

                              The cBot actually does a better job yield-wise. 120 ul of 15pM library (just over 1 billion library molecules) gives us around 40 million clusters in the lane. 4% yield.

                              Anyway, what I want is to be able to start with 20 ul of my library (unamplified) at 50pM or so and be able to cluster that to a reasonable density. That is 600 million library molecules -- like 20x more than the number of clusters I am going to get. 5% yield, so near what a cBot can deliver. To do that I could add an equal volume of NaOH -- 40 ul. Then whatever is required to neutralize that has to be in 80 ul or less. 120 ul is what I load into a lane for a cBot run. Is that too much to ask?

                              --
                              Phillip
                              Last edited by pmiguel; 06-05-2014, 04:53 AM.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              34 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              36 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              24 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...