Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate reads issue

    Hi,

    We are seeing 65% to 85% duplicate reads consistently in our RNA seq experiments. We use the Truseq and Nugen protocols for the library prep. We have done many RNA seq experiments (both single read, Paired-end) with both Truseq and Nugen protocols and see high duplicate % consistently. I know there have been many discussions about the duplicate reads issue. I would like to know if this is common issue many are facing with RNA seq experiments. I would appreciate your suggestions if any modifications in library prep would improve our results.

    Thanks in advance.

  • #2
    Originally posted by bhuv74 View Post
    Hi,

    We are seeing 65% to 85% duplicate reads consistently in our RNA seq experiments. We use the Truseq and Nugen protocols for the library prep. We have done many RNA seq experiments (both single read, Paired-end) with both Truseq and Nugen protocols and see high duplicate % consistently. I know there have been many discussions about the duplicate reads issue. I would like to know if this is common issue many are facing with RNA seq experiments. I would appreciate your suggestions if any modifications in library prep would improve our results.

    Thanks in advance.
    "Duplicate" in what sense? Sharing the same start point for one read, both reads, something else? What is your assay for identifying a "duplicate" read?

    --
    Phillip

    Comment


    • #3
      Thanks Philip.

      We use the Fastqc & FastX tools to look at the duplicate reads, also our bioinformatics uses internal program to check before and after mapping. Our assumption is duplicates reads shares the same starting point and exact match of all reads.

      Bhuv

      Comment


      • #4
        Just out of curiosity.

        Are these mammalian transcriptomes?
        Do you have any reason to believe the expression levels would be skewed towards a few genes?
        Are you using poly-A enrichment or ribosomal depletion?
        What is the starting amount of RNA?
        How many PCR cycles?

        Comment


        • #5
          Originally posted by bhuv74 View Post
          Thanks Philip.

          We use the Fastqc & FastX tools to look at the duplicate reads, also our bioinformatics uses internal program to check before and after mapping. Our assumption is duplicates reads shares the same starting point and exact match of all reads.

          Bhuv
          "Your assumption"?

          FastQC does not do the assay you describe above. See here. FastQC checks the first 200,000 sequences in a fastq file for duplication of their first 50bases (sequences longer than 75 bases are truncated to 50.) Note -- there is no check at all of the corresponding paired read.

          This means that if you have a 2000 nt. transcript that is fragmented completely randomly into minumum 100 nt. fragments, there are only 1900 possible starting places. 3800, if you include the reverse complement strand. So that means if your data set collected 7200 read pairs for a given message, 50% of them would have to duplicate a previous read's starting position. Minimum, best case.

          --
          Phillip

          Comment


          • #6
            Always we use high starting material and try to reduce PCR cycle as much as possible. For thr Truseq protocol, we use 1ug as starting material and 8 PCR cycles to amplify. I am not sure if this is related to low complexity issue because we see this high % in all sample types.

            Let me know if you need any additional information to suggest.

            Comment


            • #7
              Originally posted by blancha View Post
              Just out of curiosity.

              Are these mammalian transcriptomes?
              Do you have any reason to believe the expression levels would be skewed towards a few genes?
              Are you using poly-A enrichment or ribosomal depletion?
              What is the starting amount of RNA?
              How many PCR cycles?
              1. Yes. Most experiments are mammalian transcriptomes.
              2. In some experiments we found that several genes consumed the most duplicate reads. I could be possible but we seeing consistently high duplicate %, that's our worry.
              3. We use ribosomal depletion for all our experiments.
              4. 1 ug for truseq protocol and 20 to 30ng for Ovation protocol.
              5. We use 8 PCR cycles for amplification.

              Bhuv

              Comment


              • #8
                Originally posted by bhuv74 View Post
                1. Yes. Most experiments are mammalian transcriptomes.
                2. In some experiments we found that several genes consumed the most duplicate reads. I could be possible but we seeing consistently high duplicate %, that's our worry.
                3. We use ribosomal depletion for all our experiments.
                4. 1 ug for truseq protocol and 20 to 30ng for Ovation protocol.
                5. We use 8 PCR cycles for amplification.

                Bhuv
                The most important question is how many reads are you using for your analysis?

                Comment


                • #9
                  Did you perform any library QC? Are you certain the Ribo depletetion was a success? That could account for a surplus of duplicate reads.

                  Comment


                  • #10


                    See above. Careful what you assume is a "duplicate."

                    There are fragmentation and ligation hotspots that may mimic duplicate reads of an amplified target. Need internal controls to control for efficiency of capture for accurate quantification. http://journals.plos.org/plosone/art...l.pone.0079120

                    Comment


                    • #11
                      If you are really concerned about PCR duplicates, then you might want to do qPCR prior to PCR amplification. Then you have some feel for the total pool of pre-amp amplifiable library molecules present in the library.

                      But this all probably stems from FastQC's big red X phenomenon. A trip to FastQC's documentation should calm your fears:

                      In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates.
                      But I think there must be better tools for assessing the amount of PCR duplication in a library.

                      Finally, let me just point a finger at Illumina -- guys this is entirely your fault. You should have a low concentration library denaturation/neutralization/clustering protocol that works on double stranded libraries.

                      Think about the assay we are doing here. Once you ligate the adapters on, why do a PCR amplification? What does it gain you? A bunch of head-aches really. But it gets you up to a >2nM library concentration that can be used in a standard Illumina denaturation/neutralization. You start at 2nM, then dilute down to 0.02nM (or lower) to cluster. That means we are using PCR to pump up the concentration of libraries a least 25x higher than needed for clustering. Near as I can tell the only reason to do this is so that a buffer can then be used to neutralize, rather than an acid.

                      Okay, also, once amplified you can easily use an Agilent chip to assay your library without worrying about being below the sensitivity threshold.

                      --
                      Phillip

                      Comment


                      • #12
                        Originally posted by bilyl View Post
                        The most important question is how many reads are you using for your analysis?
                        We use ~ 55 to 60 M reads for the analysis per sample

                        Comment


                        • #13
                          Originally posted by snetmcom View Post
                          Did you perform any library QC? Are you certain the Ribo depletetion was a success? That could account for a surplus of duplicate reads.
                          We check the library size distribution using bioanalyzer. The average peak size of the libraries are close to 260bp. We quantify the libraries using Kapa qPCR.

                          We never QC'd the samples Post Ribo-Depletion since the Truseq protocol doesn't recommend checking the ribo-depleted samples. I quantified and QC'd the post-ribo-depleted samples for the current libraries i am preparing. The traces look good.

                          Bhuv

                          Comment


                          • #14
                            Originally posted by pmiguel View Post
                            If you are really concerned about PCR duplicates, then you might want to do qPCR prior to PCR amplification. Then you have some feel for the total pool of pre-amp amplifiable library molecules present in the library.

                            But this all probably stems from FastQC's big red X phenomenon. A trip to FastQC's documentation should calm your fears:



                            But I think there must be better tools for assessing the amount of PCR duplication in a library.

                            Finally, let me just point a finger at Illumina -- guys this is entirely your fault. You should have a low concentration library denaturation/neutralization/clustering protocol that works on double stranded libraries.

                            Think about the assay we are doing here. Once you ligate the adapters on, why do a PCR amplification? What does it gain you? A bunch of head-aches really. But it gets you up to a >2nM library concentration that can be used in a standard Illumina denaturation/neutralization. You start at 2nM, then dilute down to 0.02nM (or lower) to cluster. That means we are using PCR to pump up the concentration of libraries a least 25x higher than needed for clustering. Near as I can tell the only reason to do this is so that a buffer can then be used to neutralize, rather than an acid.

                            Okay, also, once amplified you can easily use an Agilent chip to assay your library without worrying about being below the sensitivity threshold.

                            --
                            Phillip
                            I don't think your comment about the PCR library is quite right.

                            Denaturing by NaOH and neutralizing with buffer doesn't necessitate pumping up the DNA concentration. Let's use the MiSeq as an example. You start with 5ul of a 2nM library, which is 0.01 picomoles of DNA. After denaturation with NaOH and dilution, you take 600ul (0.006 pmol) for sequencing (assuming negligible PhiX). So you really only need ~2-3x the amount of DNA that goes into the sequencer, not >25x.

                            With regards to the actual denaturation protocol, I'm certain something like denaturing a library in hot formamide and immediately diluting to 1ml with cold buffer will work too. You'd have negligible amounts of formamide left over, but I'm not sure what the advantage of this would be compared to NaOH.

                            Comment


                            • #15
                              Originally posted by bilyl View Post
                              I don't think your comment about the PCR library is quite right.

                              Denaturing by NaOH and neutralizing with buffer doesn't necessitate pumping up the DNA concentration. Let's use the MiSeq as an example. You start with 5ul of a 2nM library, which is 0.01 picomoles of DNA. After denaturation with NaOH and dilution, you take 600ul (0.006 pmol) for sequencing (assuming negligible PhiX). So you really only need ~2-3x the amount of DNA that goes into the sequencer, not >25x.

                              With regards to the actual denaturation protocol, I'm certain something like denaturing a library in hot formamide and immediately diluting to 1ml with cold buffer will work too. You'd have negligible amounts of formamide left over, but I'm not sure what the advantage of this would be compared to NaOH.
                              Yeah, I meant for the cbot/HiSeq.

                              The MiSeq has a much poorer yield of clusters/per molecule of library added to the the cassette than the HiSeq. 2nM is roughly 1.2 billion amplicons/ul. So if we start with 5 ul we are at 6 billion library amplicons. How many clusters do you get per v2 MiSeq run? 15 million? That corresponds to about 0.25% of the library amplicons yielding clusters. If you only count the 0.6 ml of, lets say, 10pM library that gets loaded into the cassette. Then 3.6 billion amplicons are loaded. So that gives you a 0.4% yield.

                              The cBot actually does a better job yield-wise. 120 ul of 15pM library (just over 1 billion library molecules) gives us around 40 million clusters in the lane. 4% yield.

                              Anyway, what I want is to be able to start with 20 ul of my library (unamplified) at 50pM or so and be able to cluster that to a reasonable density. That is 600 million library molecules -- like 20x more than the number of clusters I am going to get. 5% yield, so near what a cBot can deliver. To do that I could add an equal volume of NaOH -- 40 ul. Then whatever is required to neutralize that has to be in 80 ul or less. 120 ul is what I load into a lane for a cBot run. Is that too much to ask?

                              --
                              Phillip
                              Last edited by pmiguel; 06-05-2014, 04:53 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X