SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
removal of unpaired reads bioenvisage Bioinformatics 14 08-08-2014 06:30 AM
Duplicate Reads myronpeto Bioinformatics 7 03-07-2013 08:36 AM
PCR duplicate removal for whole genome sequencing vs. whole exome sequencing cliff Bioinformatics 1 09-27-2011 08:29 AM
Removal of poor quality reads before alignment gibsongenetics Bioinformatics 2 05-16-2011 06:22 AM
threshold for duplicate removal? mard Bioinformatics 2 03-21-2010 04:45 PM

Reply
 
Thread Tools
Old 09-17-2009, 06:43 PM   #1
vasvale
Member
 
Location: Seattle

Join Date: Mar 2008
Posts: 29
Default duplicate reads removal

is there any software that removes duplicate single-reads? (Casava does it for paired-end reads only)
vasvale is offline   Reply With Quote
Old 09-18-2009, 05:28 AM   #2
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

According to the man page, SAMTools has a mode to do this

http://samtools.sourceforge.net/samtools.shtml

mdup samtools rmdup <input.srt.bam> <out.bam>
Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. This command ONLY works with FR orientation and requires ISIZE is correctly set.

rmdupse samtools rmdupse <input.srt.bam> <out.bam>
Remove potential duplicates for single-ended reads. This command will treat all reads as single-ended even if they are paired in fact.
krobison is offline   Reply With Quote
Old 09-18-2009, 05:43 AM   #3
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

For duplicate removal, Picard is recommended. It does a better job than samtools-C.
lh3 is offline   Reply With Quote
Old 02-03-2010, 03:48 AM   #4
Nomijill
Member
 
Location: Southwest Florida

Join Date: Sep 2009
Posts: 24
Default Duplicate removal

Hi,

I am educating myself on duplicate removal. Why/How is Picard better than Samtools?

Thanks.
Nomijill is offline   Reply With Quote
Old 02-03-2010, 06:13 AM   #5
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Picard removes duplicates across chromosomes, but samtools cannot.
lh3 is offline   Reply With Quote
Old 10-29-2010, 12:01 PM   #6
JohnK
Senior Member
 
Location: Los Angeles, China.

Join Date: Feb 2010
Posts: 106
Default

Quote:
Originally Posted by lh3 View Post
Picard removes duplicates across chromosomes, but samtools cannot.
Is that the only notable difference?
JohnK is offline   Reply With Quote
Old 11-09-2010, 11:06 PM   #7
corthay
Member
 
Location: japan

Join Date: Oct 2008
Posts: 25
Default Removing duplicates before mapping.

Hi,

Is there any software that removes duplicate of PE or MP read
before mapping ? I would like to remove duplicate before doing
de novo assembly.
Thanks.

Corthay
corthay is offline   Reply With Quote
Old 11-10-2010, 05:11 AM   #8
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

There is no way to determine what is a PCR duplicate at that level. That is why it has to be done at mapping level. Even then, not all of them are true PCR duplicates (read lh3's statistical calculation of the expected number of PCR dups to find in a sample).
__________________
-drd
drio is offline   Reply With Quote
Old 11-10-2010, 06:04 AM   #9
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

It is possible to dedup before mapping. You may hash the first 14bp of each end and discard a pair if the 14+14bp coincides another pair. This method is not as good as deduping after mapping, but should be good enough. On the other hand, I do not think deduping is quite necessary for assembly.
lh3 is offline   Reply With Quote
Old 11-11-2010, 03:25 AM   #10
corthay
Member
 
Location: japan

Join Date: Oct 2008
Posts: 25
Default

Quote:
Originally Posted by lh3 View Post
It is possible to dedup before mapping. You may hash the first 14bp of each end and discard a pair if the 14+14bp coincides another pair. This method is not as good as deduping after mapping, but should be good enough. On the other hand, I do not think deduping is quite necessary for assembly.
Thanks for the idea. I just would like to check if deduping is necessary for assembly as Panda Genome paper did it for long insert-sizes libraries.

Corthay
corthay is offline   Reply With Quote
Old 11-11-2010, 05:14 PM   #11
JohnK
Senior Member
 
Location: Los Angeles, China.

Join Date: Feb 2010
Posts: 106
Default

Quote:
Originally Posted by corthay View Post
Thanks for the idea. I just would like to check if deduping is necessary for assembly as Panda Genome paper did it for long insert-sizes libraries.

Corthay
Hi, Corthay.

I remove duplicates for SE and PE stuffs always. PE you should be removing between 5 and 15 percent, and for SE it'll be significantly larger and anywhere between 30 to even possibly 60 percent of your reads. It depends on the quality of the PCR step of course, which I personally know little about. Also, removing duplicates really only depends on what you're doing. If you're looking at ngs/mps/hts stuffs and you wish to accurately determine all the SNPs in your data, you probably don't have time to go through each variant that's called and so you want the most accurate call. You'd remove the duplicates. However, if you have a single gene of interest you can just as easily visually inspect whatever region or SNP, regardless of whether you removed the duplicates, and determine whether that 'call' is valid or not.
JohnK is offline   Reply With Quote
Old 03-03-2011, 02:21 AM   #12
ikrier
Member
 
Location: Lausanne

Join Date: Dec 2009
Posts: 19
Default

I have tried to use the rmdup command and have found something quite strange.

I have a sam file from my alignment. I view it as a bam, and then filter on quality with :
/data/common/programs/samtools/samtools view -h $f.srt.bam | awk '{if($5 >= 10 || $1 == "@SQ" || $1 == "@PG") print $0}' | /data/common/programs/samtools/samtools view -bS - > $f.srt.unique-qual-ge10.bam

this gives me the file I want to work with. I need an output for quest, with duplicates removed, so what I tried was :
1. First get the fields in the format needed for quest then use the UNIX sort command to get the alignments with unique chromosome, position and strand.
2. First use rmdup to get a new bam file then get the fields in the format needed for quest

And the two results are different. I would have assumed that rmdup would remove the alignments with the same chromosome, strand and position, so that if I extract sequences with sort -u for these fields I would find the same number in the end.

Can anyone explain this?
ikrier is offline   Reply With Quote
Old 03-03-2011, 08:56 AM   #13
ikrier
Member
 
Location: Lausanne

Join Date: Dec 2009
Posts: 19
Default

We looked into it in the end, and it simply turns out that reads with an insertion/deletion in the alignment get their start position shifted in the output, but samtools rmdup takes it into account when removing the PCR duplicates.

I have definitely learned something today.
ikrier is offline   Reply With Quote
Old 03-09-2011, 09:17 AM   #14
ttnguyen
Member
 
Location: Ireland

Join Date: Mar 2010
Posts: 41
Default

What is acceptable PCR duplicate percentage in a ChIP-seq dataset and in a RNA-seq dataset after mapping?

In my ChIP-seq dataset, after mapping I found 66% duplicate by using Picard. I think this is too high so wanna know what is acceptable duplicate level?
ttnguyen is offline   Reply With Quote
Old 03-09-2011, 10:04 AM   #15
JohnK
Senior Member
 
Location: Los Angeles, China.

Join Date: Feb 2010
Posts: 106
Default

Quote:
Originally Posted by ttnguyen View Post
What is acceptable PCR duplicate percentage in a ChIP-seq dataset and in a RNA-seq dataset after mapping?

In my ChIP-seq dataset, after mapping I found 66% duplicate by using Picard. I think this is too high so wanna know what is acceptable duplicate level?

It's not too high necessarily. It really depends on starting DNA quantity and how much you PCR it up. I've seen between 30% and even up to over 80% depending on the context of the protein we're after.
JohnK is offline   Reply With Quote
Old 07-01-2011, 01:56 PM   #16
husamia
Member
 
Location: cinci

Join Date: Apr 2010
Posts: 66
Default

I have seen duplicate percentage of 33% in targeted high coverage experiment and based on my assessment of false positive and false negative I am happy not removing duplicates.
husamia is offline   Reply With Quote
Old 07-01-2011, 07:07 PM   #17
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by husamia View Post
I have seen duplicate percentage of 33% in targeted high coverage experiment and based on my assessment of false positive and false negative I am happy not removing duplicates.
It depends on what your experiment is. In some cases it has very little effect, in some it can be detrimental. In no case that I've heard of does leaving PCR duplicates in your data improve your results, however.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 01-07-2015, 03:15 AM   #18
panos_ed
Member
 
Location: Geneva, Switzerland

Join Date: May 2010
Posts: 11
Default

Quote:
Originally Posted by lh3 View Post
It is possible to dedup before mapping. You may hash the first 14bp of each end and discard a pair if the 14+14bp coincides another pair. This method is not as good as deduping after mapping, but should be good enough. On the other hand, I do not think deduping is quite necessary for assembly.
Why 14bp? Why not something else?
panos_ed is offline   Reply With Quote
Old 01-07-2015, 08:07 PM   #19
transforu
Junior Member
 
Location: Taiwan

Join Date: Sep 2014
Posts: 5
Default

Hi. everybody
I have a problem about remove duplicates,
In this study , http://www.ncbi.nlm.nih.gov/Traces/sra/?study=ERP000603
It's have 2 Experiments, and 13 RUNs,
How to do remove duplicates? by study? by Experiments? or by one runs?
Thanks !!
transforu is offline   Reply With Quote
Old 01-08-2015, 01:59 AM   #20
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Remove PCR duplicates by library (and this looks like one big "library" mixed from 4 PCR pools). Optical duplicates are removed by lane (but I guess Picard would detect the within lane reads, given proper read group labelling).
sarvidsson is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:52 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO