Unconfigured Ad

**pmiguel** · 06-02-2014, 06:52 AM

Originally posted by bhuv74 View Post

Hi,

We are seeing 65% to 85% duplicate reads consistently in our RNA seq experiments. We use the Truseq and Nugen protocols for the library prep. We have done many RNA seq experiments (both single read, Paired-end) with both Truseq and Nugen protocols and see high duplicate % consistently. I know there have been many discussions about the duplicate reads issue. I would like to know if this is common issue many are facing with RNA seq experiments. I would appreciate your suggestions if any modifications in library prep would improve our results.

Thanks in advance.

"Duplicate" in what sense? Sharing the same start point for one read, both reads, something else? What is your assay for identifying a "duplicate" read?

--
Phillip

**bhuv74** · 06-02-2014, 07:44 AM

Thanks Philip.

We use the Fastqc & FastX tools to look at the duplicate reads, also our bioinformatics uses internal program to check before and after mapping. Our assumption is duplicates reads shares the same starting point and exact match of all reads.

Bhuv

**blancha** · 06-02-2014, 08:18 AM

Just out of curiosity.

Are these mammalian transcriptomes?
Do you have any reason to believe the expression levels would be skewed towards a few genes?
Are you using poly-A enrichment or ribosomal depletion?
What is the starting amount of RNA?
How many PCR cycles?

**pmiguel** · 06-02-2014, 08:18 AM

Originally posted by bhuv74 View Post

Thanks Philip.

We use the Fastqc & FastX tools to look at the duplicate reads, also our bioinformatics uses internal program to check before and after mapping. Our assumption is duplicates reads shares the same starting point and exact match of all reads.

Bhuv

"Your assumption"?

FastQC does not do the assay you describe above. See here. FastQC checks the first 200,000 sequences in a fastq file for duplication of their first 50bases (sequences longer than 75 bases are truncated to 50.) Note -- there is no check at all of the corresponding paired read.

This means that if you have a 2000 nt. transcript that is fragmented completely randomly into minumum 100 nt. fragments, there are only 1900 possible starting places. 3800, if you include the reverse complement strand. So that means if your data set collected 7200 read pairs for a given message, 50% of them would have to duplicate a previous read's starting position. Minimum, best case.

--
Phillip

**bhuv74** · 06-02-2014, 08:23 AM

Always we use high starting material and try to reduce PCR cycle as much as possible. For thr Truseq protocol, we use 1ug as starting material and 8 PCR cycles to amplify. I am not sure if this is related to low complexity issue because we see this high % in all sample types.

Let me know if you need any additional information to suggest.

**bhuv74** · 06-02-2014, 09:16 AM

Originally posted by blancha View Post

Just out of curiosity.

Are these mammalian transcriptomes?
Do you have any reason to believe the expression levels would be skewed towards a few genes?
Are you using poly-A enrichment or ribosomal depletion?
What is the starting amount of RNA?
How many PCR cycles?

1. Yes. Most experiments are mammalian transcriptomes.
2. In some experiments we found that several genes consumed the most duplicate reads. I could be possible but we seeing consistently high duplicate %, that's our worry.
3. We use ribosomal depletion for all our experiments.
4. 1 ug for truseq protocol and 20 to 30ng for Ovation protocol.
5. We use 8 PCR cycles for amplification.

Bhuv

**bilyl** · 06-02-2014, 04:48 PM

Originally posted by bhuv74 View Post

1. Yes. Most experiments are mammalian transcriptomes.
2. In some experiments we found that several genes consumed the most duplicate reads. I could be possible but we seeing consistently high duplicate %, that's our worry.
3. We use ribosomal depletion for all our experiments.
4. 1 ug for truseq protocol and 20 to 30ng for Ovation protocol.
5. We use 8 PCR cycles for amplification.

Bhuv

The most important question is how many reads are you using for your analysis?

**snetmcom** · 06-02-2014, 07:18 PM

Did you perform any library QC? Are you certain the Ribo depletetion was a success? That could account for a surplus of duplicate reads.

**thomasblomquist** · 06-02-2014, 09:11 PM

Just a moment...

http://m.pnas.org/content/111/5/1891.full

See above. Careful what you assume is a "duplicate."

There are fragmentation and ligation hotspots that may mimic duplicate reads of an amplified target. Need internal controls to control for efficiency of capture for accurate quantification. http://journals.plos.org/plosone/art...l.pone.0079120

**pmiguel** · 06-03-2014, 04:30 AM

If you are really concerned about PCR duplicates, then you might want to do qPCR prior to PCR amplification. Then you have some feel for the total pool of pre-amp amplifiable library molecules present in the library.

But this all probably stems from FastQC's big red X phenomenon. A trip to FastQC's documentation should calm your fears:

In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates.

But I think there must be better tools for assessing the amount of PCR duplication in a library.

Finally, let me just point a finger at Illumina -- guys this is entirely your fault. You should have a low concentration library denaturation/neutralization/clustering protocol that works on double stranded libraries.

Think about the assay we are doing here. Once you ligate the adapters on, why do a PCR amplification? What does it gain you? A bunch of head-aches really. But it gets you up to a >2nM library concentration that can be used in a standard Illumina denaturation/neutralization. You start at 2nM, then dilute down to 0.02nM (or lower) to cluster. That means we are using PCR to pump up the concentration of libraries a least 25x higher than needed for clustering. Near as I can tell the only reason to do this is so that a buffer can then be used to neutralize, rather than an acid.

Okay, also, once amplified you can easily use an Agilent chip to assay your library without worrying about being below the sensitivity threshold.

--
Phillip

**bhuv74** · 06-03-2014, 07:12 AM

Originally posted by bilyl View Post

The most important question is how many reads are you using for your analysis?

We use ~ 55 to 60 M reads for the analysis per sample

**bhuv74** · 06-03-2014, 07:19 AM

Originally posted by snetmcom View Post

Did you perform any library QC? Are you certain the Ribo depletetion was a success? That could account for a surplus of duplicate reads.

We check the library size distribution using bioanalyzer. The average peak size of the libraries are close to 260bp. We quantify the libraries using Kapa qPCR.

We never QC'd the samples Post Ribo-Depletion since the Truseq protocol doesn't recommend checking the ribo-depleted samples. I quantified and QC'd the post-ribo-depleted samples for the current libraries i am preparing. The traces look good.

Bhuv

**bilyl** · 06-04-2014, 11:05 PM

Originally posted by pmiguel View Post

If you are really concerned about PCR duplicates, then you might want to do qPCR prior to PCR amplification. Then you have some feel for the total pool of pre-amp amplifiable library molecules present in the library.

But this all probably stems from FastQC's big red X phenomenon. A trip to FastQC's documentation should calm your fears:

But I think there must be better tools for assessing the amount of PCR duplication in a library.

Finally, let me just point a finger at Illumina -- guys this is entirely your fault. You should have a low concentration library denaturation/neutralization/clustering protocol that works on double stranded libraries.

Think about the assay we are doing here. Once you ligate the adapters on, why do a PCR amplification? What does it gain you? A bunch of head-aches really. But it gets you up to a >2nM library concentration that can be used in a standard Illumina denaturation/neutralization. You start at 2nM, then dilute down to 0.02nM (or lower) to cluster. That means we are using PCR to pump up the concentration of libraries a least 25x higher than needed for clustering. Near as I can tell the only reason to do this is so that a buffer can then be used to neutralize, rather than an acid.

Okay, also, once amplified you can easily use an Agilent chip to assay your library without worrying about being below the sensitivity threshold.

--
Phillip

I don't think your comment about the PCR library is quite right.

Denaturing by NaOH and neutralizing with buffer doesn't necessitate pumping up the DNA concentration. Let's use the MiSeq as an example. You start with 5ul of a 2nM library, which is 0.01 picomoles of DNA. After denaturation with NaOH and dilution, you take 600ul (0.006 pmol) for sequencing (assuming negligible PhiX). So you really only need ~2-3x the amount of DNA that goes into the sequencer, not >25x.

With regards to the actual denaturation protocol, I'm certain something like denaturing a library in hot formamide and immediately diluting to 1ml with cold buffer will work too. You'd have negligible amounts of formamide left over, but I'm not sure what the advantage of this would be compared to NaOH.

**pmiguel** · 06-05-2014, 04:49 AM

Originally posted by bilyl View Post

I don't think your comment about the PCR library is quite right.

Denaturing by NaOH and neutralizing with buffer doesn't necessitate pumping up the DNA concentration. Let's use the MiSeq as an example. You start with 5ul of a 2nM library, which is 0.01 picomoles of DNA. After denaturation with NaOH and dilution, you take 600ul (0.006 pmol) for sequencing (assuming negligible PhiX). So you really only need ~2-3x the amount of DNA that goes into the sequencer, not >25x.

With regards to the actual denaturation protocol, I'm certain something like denaturing a library in hot formamide and immediately diluting to 1ml with cold buffer will work too. You'd have negligible amounts of formamide left over, but I'm not sure what the advantage of this would be compared to NaOH.

Yeah, I meant for the cbot/HiSeq.

The MiSeq has a much poorer yield of clusters/per molecule of library added to the the cassette than the HiSeq. 2nM is roughly 1.2 billion amplicons/ul. So if we start with 5 ul we are at 6 billion library amplicons. How many clusters do you get per v2 MiSeq run? 15 million? That corresponds to about 0.25% of the library amplicons yielding clusters. If you only count the 0.6 ml of, lets say, 10pM library that gets loaded into the cassette. Then 3.6 billion amplicons are loaded. So that gives you a 0.4% yield.

The cBot actually does a better job yield-wise. 120 ul of 15pM library (just over 1 billion library molecules) gives us around 40 million clusters in the lane. 4% yield.

Anyway, what I want is to be able to start with 20 ul of my library (unamplified) at 50pM or so and be able to cluster that to a reasonable density. That is 600 million library molecules -- like 20x more than the number of clusters I am going to get. 5% yield, so near what a cBot can deliver. To do that I could add an equal volume of NaOH -- 40 ul. Then whatever is required to neutralize that has to be in 80 ul or less. 120 ul is what I load into a lane for a cBot run. Is that too much to ask?

--
Phillip

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 36 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Duplicate reads issue

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News