SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
removing duplicates in small (si)RNA data Kennels Bioinformatics 1 02-29-2012 09:03 PM
Removing Duplicates Scenario Exome Resequencing Hkins552 Bioinformatics 1 12-05-2011 05:23 PM
Read Duplicates hlmeng Bioinformatics 1 02-15-2011 06:17 PM
26% duplicates KevinLam Bioinformatics 4 08-19-2010 04:20 PM
Removing primers Khanjan 454 Pyrosequencing 1 02-05-2010 11:09 AM

Reply
 
Thread Tools
Old 09-14-2010, 01:02 PM   #1
foxyg
Member
 
Location: US

Join Date: May 2010
Posts: 54
Default Removing duplicates is it really necessary?

I know samtool and picard can remove duplicates. But is it really necessary? A duplicate could be PCR effect or reading same fragment twice, there is no way to tell.

Also how do you define a duplicte? Why do both sametools and picard take in bam files as input? In theory, you can remove duplicate from raw data already. Is it because they only check the aligned location not the actual read?
foxyg is offline   Reply With Quote
Old 09-14-2010, 01:27 PM   #2
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 116
Default

"A duplicate could be PCR effect or reading same fragment twice, there is no way to tell."

Yes this is the heart of the matter... The degree to which you expect duplicate fragments is highly dependent on the depth of the library and the type of library being sequenced (whole genome, exome, transcriptome, ChIP-Seq, etc.). The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. Furthermore you gain the computational benefit of reducing the number of reads to be processed in downstream steps. However, removing duplicates, comes at the cost of setting a hard cap (and a rather low one at that) on the dynamic range of measurement you are going to produce. This has been discussed in various places on this forum (not a complete list):

Redundant reads are removed from ChIP-seq, what about RNA-seq

Duplicate reads removal

Threshold for duplicate removal

Heavy read stacking on the solexa platform

Should identical reads be reduced to a single count?

Removing duplicate reads from multigig .csfasta

Source of duplicate reads and possible ways of reducing
malachig is offline   Reply With Quote
Old 09-14-2010, 02:49 PM   #3
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 116
Default

"Also how do you define a duplicate?"

As you suggest, this can be done in at least two ways. Both have caveats.

If you define duplicates as reads with identical sequence you may underestimate the presence of duplicates in your library. Read errors present in your library will result in reads with apparently different sequences that could have actually come from PCR amplifications of the same cDNA fragment. Accurate removal of duplicates is therefore dependent on a low error rate in your library. One advantage of this approach is that it reduces the pool of reads before mapping and thereby reduces alignment time.

Using mapping positions of the reads avoids the influence of library sequencing error rate described above (mostly). However, two reads that have identical mapping positions do not necessarily represent the same underlying cDNA fragment. Especially if you consider only the outer mapping coordinates (which is a common strategy). Reads derived from transcripts and then mapped to the genome might appear to have the same outer coordinates, but have variations internally (corresponding to alternative isoforms with different exon boundaries, exons skipped, etc.). Furthermore, two reads may have completely identical mapping coordinates, but still correspond to distinct cDNA fragments. Imagine two identically mapping reads that have a single base difference that corresponds to a polymorphism. cDNA fragments may be derived from both the maternal and paternal alleles of a diploid genome. These are distinct fragments containing biologically pertinent information. If you removed duplicates purely on the basis of mapping coordinates, you might lose some of this information.

As you can see there is no simple answer to this question. Identifying true duplicate fragments is not black-and-white. This perhaps argues for not trying to remove them in the first place. But on the other hand, the presence of duplicates that do represent PCR amplification bias can cause problems for downstream analysis...

In general, having longer reads and paired-end reads as opposed to single-end reads can help in correctly identifying duplicates. But there is still no way to tell for sure whether a particular read corresponds to an amplification bias or not. In library types where you expect approximately uniform coverage of the target space (e.g. a whole genome library aligned to a reference genome) you can watch for areas of unusually bad read stacking and correct for it. But even with this you have to be careful, because your genome is not the reference genome, and genuine copy number variations can result in biologically meaningful deviations from uniform coverage that you may actually be interested in...

The type of experiment and analysis being conducted may be influential in deciding how to approach the issue. If you provide more details of your particular library type and analysis goals, someone may be able to provide specific advice.

In expression analysis, I do not remove duplicates but I do attempt to quantify library complexity for every library and watch out for libraries that seem to be outliers in this regard. You can view an example of this here:
ALEXA-seq (see the figures under: 'summary of library complexity - estimated by tag redundancy per million reads and compared to other libraries')

Ultimately, the best way to deal with duplicates that correspond to PCR amplification bias is to reduce their generation in the first place. Avoid using small amounts of input material (i.e. a molecular bottleneck) and keep amplification to an absolute minimum.
malachig is offline   Reply With Quote
Old 09-14-2010, 04:14 PM   #4
foxyg
Member
 
Location: US

Join Date: May 2010
Posts: 54
Default

I have mitochondria single end reads from Illumina sequencer. The samples are pooled and pcr using mitochondria specific primers. However, since the mitochondria is very short, it end up to have average coverage over 2000X. I found there are a ton of duplicates, so I am wondering if I should remove them or not.
foxyg is offline   Reply With Quote
Old 09-14-2010, 04:21 PM   #5
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

In our RNA-Seq data we tend to get very uneven coverage of expressed genes. Removing duplicates would eliminate much of our signal from those genes, and all our read counts would be dramatically lower. For doing SNP calling that wouldn't be as much of a problem, but for differential expression analysis it complicates matters. The differential expression statistics can account for some of the variation introduced by PCR duplicates, and we've chosen to live with any variation that's left over. If our coverage were mostly evenly distributed and/or if we were doing SNP calling we may have made a different decision.
mrawlins is offline   Reply With Quote
Old 09-16-2010, 01:17 AM   #6
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

thanks malachig for a great description of the problem.

I also want to mention further about the dangers of marking duplicates. I think those that recommend it are being way too cavalier about it.

For example, there was a recent whole exome paper where they did both single end and paired end sequencing. They marked duplicates in both, and found ~28% and ~8% duplicates respectively. In fact, when they treated the paired ends as single ends (removing the connection between paired ends), the 8% jumped back up to 28%. That is a massive amount of data loss. So, those who have been throwing away duplicates , particularly single ends and based solely on coordinates, are losing a lot of data.

I would worry about the removal of duplicates just as much as the keeping of duplicates

Removing duplicates does help save time during alignments... but...

when it comes to SNPs/indels detection, you might not want to mark duplicates until *after* alignment. The reason being that the programs that do local mutiple sequence realignment around indels (eg. GATK) will not look at the duplicates, and thus the lower read counts might not trigger the realignment needed for the region. So in my opinion, it would be better to align, realign indels, and then mark duplicates.

Last edited by NGSfan; 09-16-2010 at 01:19 AM.
NGSfan is offline   Reply With Quote
Old 09-16-2010, 05:26 AM   #7
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,148
Default

Quote:
Originally Posted by NGSfan View Post
For example, there was a recent whole exome paper where they did both single end and paired end sequencing. They marked duplicates in both, and found ~28% and ~8% duplicates respectively. In fact, when they treated the paired ends as single ends (removing the connection between paired ends), the 8% jumped back up to 28%. That is a massive amount of data loss. So, those who have been throwing away duplicates , particularly single ends and based solely on coordinates, are losing a lot of data.
Even that 8% may be an overestimate of the true PCR duplicates. I recently attended the Illumina Midwest Users Group meeting and an Illumina scientist presented some data on duplicate identification (it may have been the data you referred to since the percentages sound about the same). However they went a step further in distinguishing fragment duplicates from PCR duplicates. They prepared their paired end libraries using Illumina MID tagged style adapters, but instead of a finite set of known MID sequences, the adapters were constructed with random bases where the barcode would be. Now for each cluster they had three data points to compare, reads 1 and 2 and their respective alignment positions on the reference genome, plus the random, 6bp sequence in the MID position. A read would need to match all three of these to be called a PCR duplicate. When they added these random tags they found that the number of identified duplicates dropped from 8% to ~1%.
kmcarr is offline   Reply With Quote
Old 09-16-2010, 05:45 AM   #8
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

wow... kmcarr, what a great trick. Thanks for sharing that. That really answers the question: how many real PCR duplicates am I getting? From that info you could play around with PCR cycle numbers and figure out how many cycles to drop.
NGSfan is offline   Reply With Quote
Old 09-16-2010, 06:54 AM   #9
foxyg
Member
 
Location: US

Join Date: May 2010
Posts: 54
Default

Quote:
Originally Posted by NGSfan View Post

For example, there was a recent whole exome paper where they did both single end and paired end sequencing. They marked duplicates in both, and found ~28% and ~8% duplicates respectively. In fact, when they treated the paired ends as single ends (removing the connection between paired ends), the 8% jumped back up to 28%. That is a massive amount of data loss. So, those who have been throwing away duplicates , particularly single ends and based solely on coordinates, are losing a lot of data.
Can you point out which paper it is please?
foxyg is offline   Reply With Quote
Old 09-16-2010, 01:49 PM   #10
dfornika
Junior Member
 
Location: Vancouver, BC

Join Date: Aug 2009
Posts: 4
Default

Hi foxyg,

I've done a very similar experiment, but hadn't considered the issue of removing duplicates before. I pooled pcr-amplified mtDNA from ~415 individuals and ran this DNA on two lanes, doing single-ended 36-mer reads. I achieved about 6000x coverage with the combined dataset.

I'm interested in what you plan on getting out of your alignments. I want to estimate the allele frequencies of SNPs by counting the relative abundance of each allele at each position in the alignment. My opinion is that with such high coverage, it is inevitable that you will have many reads mapping to the exact same location, and you want to preserve all of these reads if you want to estimate allele frequencies in a pool.

I'm also interested in looking for evidence of small-scale indels and large deletions, but have been having difficulty making progress on this. I've done most of my analysis with maq, which doesn't do gapped alignments, and I have no paired-end data. I'm looking into indel detection with bwa, but having trouble migrating over to the newer file formats.

Let me know if you want to discuss more details of our projects, as they seem to be quite similar.
dfornika is offline   Reply With Quote
Old 09-16-2010, 03:31 PM   #11
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

In Illumina sequencing, there are typically two types of duplicates: PCR duplicates and optical duplicates. Optical duplicates are sequences from one cluster in fact but identified by software from multiple adjacent clusters. They can be identified without alignment. You just check sequence and the coordinates on the image.

PCR duplicates are usually identified after alignment. Although there are alignment-free approaches, they are less powerful when alignment is available. Dedup typically works by identifying read pairs having identical 5'-end coordinates (3'-end coordinates are not considered). The chance of two 5'-end coordinates being identical depends on the coverage, but usually small. You can calculate the theoretical false dedup rate using my formula (0.28*m/s/L, where m is the number of read pairs, s is the standard deviation of insert size distribution and L is the length of the genome; sharp insert size distribution (small s) increases false dedup rate unfortunately). If the empirical duplicate rate is much higher than this 0.28*m/s/L, they are true PCR duplicates.

Dedup for single-end reads has high false positive rate given deep data. This is fine for SNP calling because the 40X coverage is enough for SNP calling. However, it is dangerous to dedup SE reads for RNA-seq or ChIP-seq where read count matters. As mrawlins has suggested, it would be better to account for duplicates in your read counting model rather than run a dedup program.

The rate of PCR duplicates is 0.5*m/N (m is the number of sequenced reads and N is the number of DNA molecules before amplification). It is not affected by the PCR cycles, at least not greatly. The key to reducing PCR duplicates is to get enough DNA (get large N). Also, the dup rate is proportional to M. The more you sequence, the higher dup rate you get.
lh3 is offline   Reply With Quote
Old 09-20-2010, 03:16 AM   #12
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by foxyg View Post
Can you point out which paper it is please?
The paper is the Brainbridge 2010 paper in Genome Biology:

http://genomebiology.com/2010/11/6/R62

Table 3 is the relevant figure.

Page 3 describes in more detail, the treatment of PE as SE:

PE sequencing, in contrast, allows us to
use information about both the start and the end of the
capture-DNA fragment in order to determine whether
the data are derived from independent DNA molecules.
Thus, the increased information content of PE data
allows us to reduce the misidentification of duplicate
reads.
To test these theories we also analyzed the PE data as if
they were generated from a single-ended frag library.
This caused the on-target alignment rate to drop slightly
to 73% and the duplicate rate to nearly quadruple to
27.6%, virtually identical to the Illumina frag library
duplicate rate.
NGSfan is offline   Reply With Quote
Old 10-15-2010, 09:08 AM   #13
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

It has just occurred to me that if PCR duplicates are unbiased (although in fact they are), we may not need to dedup because the read count is also unbiased. Duplicates affect the variance but not the mean. Dedup is more important to variant discovery. Because although the read count in unbiased, the large variance will lead to more false calls.

Am I about right?

Last edited by lh3; 10-15-2010 at 09:11 AM.
lh3 is offline   Reply With Quote
Old 10-15-2010, 09:36 AM   #14
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Quote:
Originally Posted by lh3 View Post
It has just occurred to me that if PCR duplicates are unbiased (although in fact they are), we may not need to dedup because the read count is also unbiased. Duplicates affect the variance but not the mean. Dedup is more important to variant discovery. Because although the read count in unbiased, the large variance will lead to more false calls.

Am I about right?
I'm probably going to write something wrong, but for ChIP-seq (and RNA-seq) the sample is biased (due to enrichment) and I believe this bias is propagated to PCR (and read counts)... rich gets richer and most represented sequences are more prone to be resequenced...
Ok, I stop here before I harm my reliability in this community :-)
dawe is offline   Reply With Quote
Old 10-15-2010, 10:07 AM   #15
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Thanks, dawe. I have not thought about this bit. But naively, if PCR is unbiased (i.e. every sequence is amplified for the same number of cycles), it should keep the ratio of read counts between different loci. The problem is the enlarged variance, but perhaps the variance caused by duplicates is still small in comparison to the variance in gene expression.

I am not familiar with chip/rna-seq. Probably I am saying something wrong and my intuition is always unreliable. I just want to know what is the true answer...
lh3 is offline   Reply With Quote
Old 10-15-2010, 11:26 AM   #16
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Quote:
Originally Posted by lh3 View Post
The problem is the enlarged variance, but perhaps the variance caused by duplicates is still small in comparison to the variance in gene expression.
Hopefully I will be able to evaluate this (when I'll have some time)... probably there's a paper that I haven't read about this :-D
It's time to make a call for a collaborative work on the effect of sequence dups on these kind of experiments.

Quote:
Originally Posted by lh3 View Post
I just want to know what is the true answer...
I want that too :-)

d
dawe is offline   Reply With Quote
Old 11-19-2010, 02:07 PM   #17
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Quote:
Originally Posted by kmcarr View Post
Even that 8% may be an overestimate of the true PCR duplicates. I recently attended the Illumina Midwest Users Group meeting and an Illumina scientist presented some data on duplicate identification (it may have been the data you referred to since the percentages sound about the same). However they went a step further in distinguishing fragment duplicates from PCR duplicates. They prepared their paired end libraries using Illumina MID tagged style adapters, but instead of a finite set of known MID sequences, the adapters were constructed with random bases where the barcode would be. Now for each cluster they had three data points to compare, reads 1 and 2 and their respective alignment positions on the reference genome, plus the random, 6bp sequence in the MID position. A read would need to match all three of these to be called a PCR duplicate. When they added these random tags they found that the number of identified duplicates dropped from 8% to ~1%.
Then is there still value in doing duplicate removal?
bioinfosm is offline   Reply With Quote
Old 07-01-2011, 12:37 PM   #18
husamia
Member
 
Location: cinci

Join Date: Apr 2010
Posts: 66
Default

I don't do the bench work stuff but when I discuss the steps that are followed I came to belive that there are multiple enzymatic, sensitive steps which known to introuce bias by their nature, we are assessing duplicates after all those steps have been completed. i.e. amplification, apater ligation, re-amplification, etc etc. I belive it depends on the preperation and approach as to account for such known issues generation bias. For SNP from my experience removing duplicates is unneccesary. I suggest inspecting the alignment as a secondary step to see if there are duplicates which may prompt further investigation.
husamia is offline   Reply With Quote
Old 07-01-2011, 02:42 PM   #19
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Quote:
Originally Posted by husamia View Post
For SNP from my experience removing duplicates is unneccesary. I suggest inspecting the alignment as a secondary step to see if there are duplicates which may prompt further investigation.
The first time I and a few others learned about the PCR duplicates was around the middle of 2007, when Illumina (at that time Solexa) were collaborating with Sanger on SNP discovery in data from a flow-sorted X chromosome. Because Illumina/Solexa was short of DNA, they used very little remaining DNA to construct the library. When we looked at structural variation and SNP calls from that data set, we found many recurrent "sequencing" errors. Richard pointed out that this was likely to be caused by PCR. I then implemented the "rmdup" component in maq. When we applied that, we got much cleaner SNP/SV calls. In the end, the paper on this data set was published in Nature in 2008 as one the first three whole-genome resequencing papers using Illumina.

PCR duplicates really affect SNP calling. It is true that when the duplicate rate is low, they probably have little impact to the final results, but for most resequencing project, removing duplicates does almost no harm. Why not do that?
lh3 is offline   Reply With Quote
Old 07-01-2011, 06:03 PM   #20
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by husamia View Post
I don't do the bench work stuff but when I discuss the steps that are followed I came to belive that there are multiple enzymatic, sensitive steps which known to introuce bias by their nature, we are assessing duplicates after all those steps have been completed. i.e. amplification, apater ligation, re-amplification, etc etc. I belive it depends on the preperation and approach as to account for such known issues generation bias. For SNP from my experience removing duplicates is unneccesary. I suggest inspecting the alignment as a secondary step to see if there are duplicates which may prompt further investigation.
Removing duplicates has been demonstrated a number of times to improve SNP detection, particularly on samples sequenced at low read depths and in cancer samples. If at this point you aren't seeing a benefit to de-dupping, I think the community would be very interested in your evidence of that.

The polymerases we use for PCR are fairly biased and do not amplify linearly across all regions, so PCR duplicates likely have a dramatic effect on copy number detection, SV detection, ChIP-seq, and RNA-seq in particular. Because duplicate removal since those first papers has been such a common part of sequence analysis, I do not recall a study actually assessing how they impact CN/ChIP-seq/RNA-seq, but I do recall in our early whole genome sequencing of highly aberrant cancers at UCLA that implementing duplicate removal significantly improved SV detection. It certainly wouldn't be that difficult of an experiment to demonstrate the exact effect of de-dupping in the modern era if someone had a few spare cycles to find out (then again, it's probably been done and I just don't know the paper off the top of my head).

That said, think about what a waste PCR duplicates are. There is little that causes as much attrition as PCR in a sequencing experiment. We're talking 5-20% of sequence data generated worldwide being thrown out. We can get it pretty low with a great deal of care calibrating our protocols (and a little bit of luck), but it still accounts for a significant percent of any dataset.

Some potential solutions are revealing themselves: http://www.nature.com/nprot/journal/....2011.345.html
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:36 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO