SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
duplicate reads removal vasvale Bioinformatics 19 01-08-2015 12:59 AM
Duplicate Reads myronpeto Bioinformatics 7 03-07-2013 07:36 AM
duplicate reads in Illumina short, single end reads of RNAseq data inbarpl Bioinformatics 4 05-22-2012 08:36 AM
GC content of duplicate reads only swNGS Bioinformatics 4 03-20-2012 01:05 AM
duplicate reads in ChIPSeq tec Illumina/Solexa 7 10-08-2009 04:23 AM

Reply
 
Thread Tools
Old 06-02-2014, 06:45 AM   #1
bhuv74
Junior Member
 
Location: Florida

Join Date: Jun 2010
Posts: 6
Default Duplicate reads issue

Hi,

We are seeing 65% to 85% duplicate reads consistently in our RNA seq experiments. We use the Truseq and Nugen protocols for the library prep. We have done many RNA seq experiments (both single read, Paired-end) with both Truseq and Nugen protocols and see high duplicate % consistently. I know there have been many discussions about the duplicate reads issue. I would like to know if this is common issue many are facing with RNA seq experiments. I would appreciate your suggestions if any modifications in library prep would improve our results.

Thanks in advance.
bhuv74 is offline   Reply With Quote
Old 06-02-2014, 06:52 AM   #2
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by bhuv74 View Post
Hi,

We are seeing 65% to 85% duplicate reads consistently in our RNA seq experiments. We use the Truseq and Nugen protocols for the library prep. We have done many RNA seq experiments (both single read, Paired-end) with both Truseq and Nugen protocols and see high duplicate % consistently. I know there have been many discussions about the duplicate reads issue. I would like to know if this is common issue many are facing with RNA seq experiments. I would appreciate your suggestions if any modifications in library prep would improve our results.

Thanks in advance.
"Duplicate" in what sense? Sharing the same start point for one read, both reads, something else? What is your assay for identifying a "duplicate" read?

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-02-2014, 07:44 AM   #3
bhuv74
Junior Member
 
Location: Florida

Join Date: Jun 2010
Posts: 6
Default

Thanks Philip.

We use the Fastqc & FastX tools to look at the duplicate reads, also our bioinformatics uses internal program to check before and after mapping. Our assumption is duplicates reads shares the same starting point and exact match of all reads.

Bhuv
bhuv74 is offline   Reply With Quote
Old 06-02-2014, 08:18 AM   #4
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Just out of curiosity.

Are these mammalian transcriptomes?
Do you have any reason to believe the expression levels would be skewed towards a few genes?
Are you using poly-A enrichment or ribosomal depletion?
What is the starting amount of RNA?
How many PCR cycles?
blancha is offline   Reply With Quote
Old 06-02-2014, 08:18 AM   #5
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by bhuv74 View Post
Thanks Philip.

We use the Fastqc & FastX tools to look at the duplicate reads, also our bioinformatics uses internal program to check before and after mapping. Our assumption is duplicates reads shares the same starting point and exact match of all reads.

Bhuv
"Your assumption"?

FastQC does not do the assay you describe above. See here. FastQC checks the first 200,000 sequences in a fastq file for duplication of their first 50bases (sequences longer than 75 bases are truncated to 50.) Note -- there is no check at all of the corresponding paired read.

This means that if you have a 2000 nt. transcript that is fragmented completely randomly into minumum 100 nt. fragments, there are only 1900 possible starting places. 3800, if you include the reverse complement strand. So that means if your data set collected 7200 read pairs for a given message, 50% of them would have to duplicate a previous read's starting position. Minimum, best case.

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-02-2014, 08:23 AM   #6
bhuv74
Junior Member
 
Location: Florida

Join Date: Jun 2010
Posts: 6
Default

Always we use high starting material and try to reduce PCR cycle as much as possible. For thr Truseq protocol, we use 1ug as starting material and 8 PCR cycles to amplify. I am not sure if this is related to low complexity issue because we see this high % in all sample types.

Let me know if you need any additional information to suggest.
bhuv74 is offline   Reply With Quote
Old 06-02-2014, 09:16 AM   #7
bhuv74
Junior Member
 
Location: Florida

Join Date: Jun 2010
Posts: 6
Default

Quote:
Originally Posted by blancha View Post
Just out of curiosity.

Are these mammalian transcriptomes?
Do you have any reason to believe the expression levels would be skewed towards a few genes?
Are you using poly-A enrichment or ribosomal depletion?
What is the starting amount of RNA?
How many PCR cycles?
1. Yes. Most experiments are mammalian transcriptomes.
2. In some experiments we found that several genes consumed the most duplicate reads. I could be possible but we seeing consistently high duplicate %, that's our worry.
3. We use ribosomal depletion for all our experiments.
4. 1 ug for truseq protocol and 20 to 30ng for Ovation protocol.
5. We use 8 PCR cycles for amplification.

Bhuv
bhuv74 is offline   Reply With Quote
Old 06-02-2014, 04:48 PM   #8
bilyl
Member
 
Location: USA

Join Date: Aug 2013
Posts: 52
Default

Quote:
Originally Posted by bhuv74 View Post
1. Yes. Most experiments are mammalian transcriptomes.
2. In some experiments we found that several genes consumed the most duplicate reads. I could be possible but we seeing consistently high duplicate %, that's our worry.
3. We use ribosomal depletion for all our experiments.
4. 1 ug for truseq protocol and 20 to 30ng for Ovation protocol.
5. We use 8 PCR cycles for amplification.

Bhuv
The most important question is how many reads are you using for your analysis?
bilyl is offline   Reply With Quote
Old 06-02-2014, 07:18 PM   #9
snetmcom
Senior Member
 
Location: USA

Join Date: Oct 2008
Posts: 158
Default

Did you perform any library QC? Are you certain the Ribo depletetion was a success? That could account for a surplus of duplicate reads.
snetmcom is offline   Reply With Quote
Old 06-02-2014, 09:11 PM   #10
thomasblomquist
Member
 
Location: Ohio

Join Date: Jul 2012
Posts: 68
Default

http://m.pnas.org/content/111/5/1891.full

See above. Careful what you assume is a "duplicate."

There are fragmentation and ligation hotspots that may mimic duplicate reads of an amplified target. Need internal controls to control for efficiency of capture for accurate quantification. http://journals.plos.org/plosone/art...l.pone.0079120
thomasblomquist is offline   Reply With Quote
Old 06-03-2014, 04:30 AM   #11
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

If you are really concerned about PCR duplicates, then you might want to do qPCR prior to PCR amplification. Then you have some feel for the total pool of pre-amp amplifiable library molecules present in the library.

But this all probably stems from FastQC's big red X phenomenon. A trip to FastQC's documentation should calm your fears:

Quote:
In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates.
But I think there must be better tools for assessing the amount of PCR duplication in a library.

Finally, let me just point a finger at Illumina -- guys this is entirely your fault. You should have a low concentration library denaturation/neutralization/clustering protocol that works on double stranded libraries.

Think about the assay we are doing here. Once you ligate the adapters on, why do a PCR amplification? What does it gain you? A bunch of head-aches really. But it gets you up to a >2nM library concentration that can be used in a standard Illumina denaturation/neutralization. You start at 2nM, then dilute down to 0.02nM (or lower) to cluster. That means we are using PCR to pump up the concentration of libraries a least 25x higher than needed for clustering. Near as I can tell the only reason to do this is so that a buffer can then be used to neutralize, rather than an acid.

Okay, also, once amplified you can easily use an Agilent chip to assay your library without worrying about being below the sensitivity threshold.

--
Phillip
pmiguel is offline   Reply With Quote
Old 06-03-2014, 07:12 AM   #12
bhuv74
Junior Member
 
Location: Florida

Join Date: Jun 2010
Posts: 6
Default

Quote:
Originally Posted by bilyl View Post
The most important question is how many reads are you using for your analysis?
We use ~ 55 to 60 M reads for the analysis per sample
bhuv74 is offline   Reply With Quote
Old 06-03-2014, 07:19 AM   #13
bhuv74
Junior Member
 
Location: Florida

Join Date: Jun 2010
Posts: 6
Default

Quote:
Originally Posted by snetmcom View Post
Did you perform any library QC? Are you certain the Ribo depletetion was a success? That could account for a surplus of duplicate reads.
We check the library size distribution using bioanalyzer. The average peak size of the libraries are close to 260bp. We quantify the libraries using Kapa qPCR.

We never QC'd the samples Post Ribo-Depletion since the Truseq protocol doesn't recommend checking the ribo-depleted samples. I quantified and QC'd the post-ribo-depleted samples for the current libraries i am preparing. The traces look good.

Bhuv
bhuv74 is offline   Reply With Quote
Old 06-04-2014, 11:05 PM   #14
bilyl
Member
 
Location: USA

Join Date: Aug 2013
Posts: 52
Default

Quote:
Originally Posted by pmiguel View Post
If you are really concerned about PCR duplicates, then you might want to do qPCR prior to PCR amplification. Then you have some feel for the total pool of pre-amp amplifiable library molecules present in the library.

But this all probably stems from FastQC's big red X phenomenon. A trip to FastQC's documentation should calm your fears:



But I think there must be better tools for assessing the amount of PCR duplication in a library.

Finally, let me just point a finger at Illumina -- guys this is entirely your fault. You should have a low concentration library denaturation/neutralization/clustering protocol that works on double stranded libraries.

Think about the assay we are doing here. Once you ligate the adapters on, why do a PCR amplification? What does it gain you? A bunch of head-aches really. But it gets you up to a >2nM library concentration that can be used in a standard Illumina denaturation/neutralization. You start at 2nM, then dilute down to 0.02nM (or lower) to cluster. That means we are using PCR to pump up the concentration of libraries a least 25x higher than needed for clustering. Near as I can tell the only reason to do this is so that a buffer can then be used to neutralize, rather than an acid.

Okay, also, once amplified you can easily use an Agilent chip to assay your library without worrying about being below the sensitivity threshold.

--
Phillip
I don't think your comment about the PCR library is quite right.

Denaturing by NaOH and neutralizing with buffer doesn't necessitate pumping up the DNA concentration. Let's use the MiSeq as an example. You start with 5ul of a 2nM library, which is 0.01 picomoles of DNA. After denaturation with NaOH and dilution, you take 600ul (0.006 pmol) for sequencing (assuming negligible PhiX). So you really only need ~2-3x the amount of DNA that goes into the sequencer, not >25x.

With regards to the actual denaturation protocol, I'm certain something like denaturing a library in hot formamide and immediately diluting to 1ml with cold buffer will work too. You'd have negligible amounts of formamide left over, but I'm not sure what the advantage of this would be compared to NaOH.
bilyl is offline   Reply With Quote
Old 06-05-2014, 04:49 AM   #15
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by bilyl View Post
I don't think your comment about the PCR library is quite right.

Denaturing by NaOH and neutralizing with buffer doesn't necessitate pumping up the DNA concentration. Let's use the MiSeq as an example. You start with 5ul of a 2nM library, which is 0.01 picomoles of DNA. After denaturation with NaOH and dilution, you take 600ul (0.006 pmol) for sequencing (assuming negligible PhiX). So you really only need ~2-3x the amount of DNA that goes into the sequencer, not >25x.

With regards to the actual denaturation protocol, I'm certain something like denaturing a library in hot formamide and immediately diluting to 1ml with cold buffer will work too. You'd have negligible amounts of formamide left over, but I'm not sure what the advantage of this would be compared to NaOH.
Yeah, I meant for the cbot/HiSeq.

The MiSeq has a much poorer yield of clusters/per molecule of library added to the the cassette than the HiSeq. 2nM is roughly 1.2 billion amplicons/ul. So if we start with 5 ul we are at 6 billion library amplicons. How many clusters do you get per v2 MiSeq run? 15 million? That corresponds to about 0.25% of the library amplicons yielding clusters. If you only count the 0.6 ml of, lets say, 10pM library that gets loaded into the cassette. Then 3.6 billion amplicons are loaded. So that gives you a 0.4% yield.

The cBot actually does a better job yield-wise. 120 ul of 15pM library (just over 1 billion library molecules) gives us around 40 million clusters in the lane. 4% yield.

Anyway, what I want is to be able to start with 20 ul of my library (unamplified) at 50pM or so and be able to cluster that to a reasonable density. That is 600 million library molecules -- like 20x more than the number of clusters I am going to get. 5% yield, so near what a cBot can deliver. To do that I could add an equal volume of NaOH -- 40 ul. Then whatever is required to neutralize that has to be in 80 ul or less. 120 ul is what I load into a lane for a cBot run. Is that too much to ask?

--
Phillip

Last edited by pmiguel; 06-05-2014 at 04:53 AM.
pmiguel is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:42 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO