SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
example for using Picard removing duplicate reads? fabrice Bioinformatics 9 10-18-2013 03:32 AM
Problem to use Bedtools after filtering uniquely mapped reads with samtools eilosei Bioinformatics 2 12-21-2011 04:51 PM
Removing similar sequence reads loba17 Bioinformatics 4 10-17-2011 08:31 AM
Removing duplicate reads for tophat? hong_sunwoo RNA Sequencing 2 10-09-2010 01:46 AM
Removing duplicate reads from multigig .csfasta Bueller_007 Bioinformatics 7 06-26-2010 04:07 PM

Reply
 
Thread Tools
Old 02-01-2011, 02:42 PM   #1
cbl
Junior Member
 
Location: California

Join Date: Feb 2011
Posts: 1
Default Problem removing duplicate reads? (samtools and picard)

Hello,

I am having trouble removing duplicate reads with samtools and picard. I have a sam/bam file that is of paired-end illumina reads mapped to the genome (minimal file below). I have coordinate-sorted the sam/bam file and tried both:

java -jar picard-tools-1.38/MarkDuplicates.jar I=problemSorted.bam O=out.sam M=log.txt REMOVE_DUPLICATES=true
-and-
samtools rmdup problemSorted.bam out.bam

and neither seem to remove the lines that I believe are duplicates. The two read pairs have the exact same sequence and mapping, so I believe I should only have one pair after removing duplicates. I have read that samtools may not work in this case because the two ends of the reads map to different chromosomes, but I am surprised that picard is not removing them.

The duplicate reads are removed by:
samtools rmdup problemSorted.bam out.bam -S
but I am worried that treating all my paired-end reads as single reads when removing duplicates may lead to other issues.

Thanks for your help! Minimal input file is below:

@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:28195914
@SQ SN:chr2 LN:19369704
HWI-EAS000_1:5:71:14037:15683:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC IIIIIIIIIIGIIIIIHIIIIIGIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
HWI-EAS000_1:5:80:6679:2759:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC HGIIIIHIHHIIIHIIIIIIIIHIIIIIIIIIIIIH X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
HWI-EAS000_1:5:71:14037:15683:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIIIIIIIHIHIHIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
HWI-EAS000_1:5:80:6679:2759:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIHIIIHIIIIIIIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
cbl is offline   Reply With Quote
Old 02-01-2011, 03:25 PM   #2
d17
Member
 
Location: United States

Join Date: Sep 2008
Posts: 27
Default

According to the SAM format specification (http://samtools.sourceforge.net/SAM1.pdf) , read names must be the same for read pairs. Your read pairs do not seem to follow this rule, so it appears they are not being recognized as pairs by MarkDuplicates.
d17 is offline   Reply With Quote
Old 07-22-2011, 11:07 AM   #3
shruti
Member
 
Location: Milan

Join Date: Mar 2010
Posts: 35
Default

Hi,

I'm also facing the same problem here is case in question. The read names are same.

HWI-ST220_63:6:2108:11340:115090 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIHGIGJIJJIJHGIHGGJJJJJJJJIJJJIJJJIJGHHHHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:2108:11340:115090 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA HHJIJJIIIJJJJJJJJIHDJJJIJIJJJJJIIJJJJHHHHHDFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:1108:12937:121403 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIGJIHGJJJJIHGJJJIJJJJJJIJHJIJIJJJIJHHHFHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:1108:12937:121403 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA JJIIJIJIGJJGIIJJIGE;JJJJIIJJJJIIIJJJJHFHHHFFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

Any pointers would be most appreciated.
shruti is offline   Reply With Quote
Old 07-22-2011, 11:28 AM   #4
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Need to include the command you used, etc.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 07-24-2011, 01:46 AM   #5
shruti
Member
 
Location: Milan

Join Date: Mar 2010
Posts: 35
Default

I had 1st sorted the file
samtools sort BothEndsMapped.bam BothEndsMapped.sorted
and then rmdup..
samtools rmdup BothEndsMapped.sorted.bam BothEndsMapped.sorted.unique.bam
I guess by default it does for paired end reads.
shruti is offline   Reply With Quote
Old 07-24-2011, 07:19 AM   #6
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi cbl,

Can you explain to us why duplicate removal, especially removing the pairs that map to different chr. is critical to your analysis? I always treat duplicate removal as nice-to-have but not must-to-have.

Thank you,
Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-24-2011, 08:24 AM   #7
shruti
Member
 
Location: Milan

Join Date: Mar 2010
Posts: 35
Default

Well I'm looking for cases of rearrangements hence, reads mapping in different chromosomes is important.
As far as removing duplicates is considered, its quite subjective to the need as well as coverage.
To start with being strict on the data I'm removing the duplicates. Cases of different reads showing same rearrangement would be say 90% chance that they are case of rearrangement (provided they are not into repeats/segmental duplicates). In case of low depth of coverage it would be quite difficult distinguish rearrangements due to the real data and PCR duplicates, so have to be little strict on that front.
shruti is offline   Reply With Quote
Old 07-24-2011, 08:43 AM   #8
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi shruti,

I understand your points. For me, I use a different approach in rearrangement analysis. I rely on a very tiny subset of the raw reads to infer the rearrangement. So at the beginning I use very nonrestrictive criteria to locate candidate reads that indicate potential rearrangement. Then I will dive into these reads. In the context of this post, including duplicate reads at the beginning does not hurt your chance finding/validating the rearrangement. When you have a small subset of candidate reads, you may look into duplication, most likely with custom scripts.

So my suggestion here is not to dwell too much on removing duplicated reads at the beginning as some general-purpose tools (e.g., samtools) may not be suitable for your specific needs.

Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-24-2011, 09:06 AM   #9
shruti
Member
 
Location: Milan

Join Date: Mar 2010
Posts: 35
Default

Well..yes I guess one can start with being lenient. Just curious, if you start with these reads, then how do you filter at the end. or do you also experimentally validate these cases. how much of a false positive do you observe.
shruti is offline   Reply With Quote
Old 07-24-2011, 11:45 AM   #10
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi shruti,

I usually use custom scripts to screen for 1) whether they map to the same chr. 2) how far apart and in which orientation 3) mapping quality and ambiguity 4) read counts (independent or PCR-duplication?). I am reluctant to put a hard number on false positive but I agree with you that it is critical to experimentally follow up and validate the models obtained from next-gen data.

Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-24-2011, 03:45 PM   #11
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by shruti View Post
I had 1st sorted the file
samtools sort BothEndsMapped.bam BothEndsMapped.sorted
and then rmdup..
samtools rmdup BothEndsMapped.sorted.bam BothEndsMapped.sorted.unique.bam
I guess by default it does for paired end reads.
Use Picard MarkDuplicates instead and get back to us. Samtools rmdup is not recommended.

If you add more responses, copy and paste the exact command. Remove file names if they're sensitive.

Quote:
Originally Posted by DZhang View Post
Can you explain to us why duplicate removal, especially removing the pairs that map to different chr. is critical to your analysis? I always treat duplicate removal as nice-to-have but not must-to-have.
There has never been a demonstrated case where leaving duplicates in is advantageous.

In contrast, numerous analyses have demonstrated beneficial effects of removing duplicates.

Many discussions of this have been done on this site in the past. http://seqanswers.com/forums/showthread.php?t=6854 for example.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 07-24-2011, 04:12 PM   #12
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi Michael.James.Clark,

I looked at the thread you recommended - http://seqanswers.com/forums/showthread.php?t=6854. It appears to me that you were the only one who positively and absolutely advocate for duplicate removal. Others were either not sure or took a balanced approach as to the potential pros and cons. I certainly recognize the benefits of duplicate removal, due to the limitations of current laboratory technology and bioinformatic methods, I would not put it as must-have as long as the user considers the impact of duplicate when interpreting the analysis results.

Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-24-2011, 04:55 PM   #13
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by DZhang View Post
I looked at the thread you recommended - http://seqanswers.com/forums/showthread.php?t=6854. It appears to me that you were the only one who positively and absolutely advocate for duplicate removal. Others were either not sure or took a balanced approach as to the potential pros and cons.
I think you ought to read that thread more carefully then, because that is not what I said, and I was not the only one to explain that duplicate removal benefits many downstream analyses. (Oh, by the way, I chose that thread because it is in my history since I replied to it, but there are many others on the same topic that you could read up on that might interest you.)

I will clarify my statement for you: Were we talking about ChIP-seq or RNA-seq here, I might agree that duplicate removal is a questionable step to take. Certainly if we were talking about single-end data, I would say that it's probably not necessary (and, in fact, is likely throwing the baby out with the bathwater in many cases).

However, it seems this is paired WGS (or some other DNA-seq) with the goal of variant detection. In that case, de-dupping has been shown to be beneficial for variant calling (SNVs, indels, SVs, CNVs).

With regards to SVs, de-dupping is particularly useful for long mate-paired data. For simple PE data at adequate depth, it may have little impact. However, I've not seen anyone demonstrate it hurts SV detection in those cases, therefore why not do it? Especially if you'll be using that same data for SNV calling, where it has demonstrably significant effects.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 07-24-2011, 05:21 PM   #14
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi Michael.James.Clark,

Thank you for your additional comment. I would not characterize my opinion on duplicate removal as "questionable" step. I completely recognize the benefits of doing it at the beginning of the pipeline, but would not take it as a mandatory step - "must-have". In this particular case, cbl had executed the duplicate removal. My suggestion to cbl was to go ahead with the downstream analysis instead of halting the project progress. The mere fact that both programs, samtools and picard, have built-in functions to remove duplicates but one may work and the other may not work in a particular case exactly illustrates there is a judgment call on duplicate removal.

It of course has value to find a solution that can remove the duplicates that are mapped to diffferent chr. and I hope your suggestion of picard will resolve that.

Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-24-2011, 05:32 PM   #15
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by DZhang View Post
The mere fact that both programs, samtools and picard, have built-in functions to remove duplicates but one may work and the other may not work in a particular case exactly illustrates there is a judgment call on duplicate removal.
Samtools rmdup and Picard MarkDuplicates are not different due to "judgment calls".

See here: http://sourceforge.net/apps/mediawik...tools_rmdup.3F

"samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates."

That's because of the way samtools rmdup works... it looks for successive reads mapping to the same spot after sorting without considering the pair. It's not because the authors intended it to miss those from what I understand--it's because way back when they wrote it, they didn't consider that case.

(By the way, that also explains why shruti is seeing interchromosomal duplicates in her results, and why I told her to use Picard rather than samtools in the first place. )

Also, I'm not sure actually if samtools rmdup considers the second end at all. I think it does, but I think it simply doesn't work on interchromosomals. Either way, it's safest to just use Picard from what I've seen.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]

Last edited by Michael.James.Clark; 07-24-2011 at 05:35 PM.
Michael.James.Clark is offline   Reply With Quote
Old 07-24-2011, 07:46 PM   #16
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi Michael.James.Clark,

Thank you for the clarfication. I did not know actually the difference was clearly documented. I learn a few new things in this field every day!

This may be suited for a new thread but what's your view on duplicate removal for single-end reads?

Regards,
Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-25-2011, 02:25 AM   #17
shruti
Member
 
Location: Milan

Join Date: Mar 2010
Posts: 35
Default

Hi Michael,
Thanks for the info. I was not aware of this limitation of rmdup.
Will try Picard and see. (and next time I would be cautious of the filenames I write)

Hi Dzhang,
Thanks for the info on your pipeline.
shruti is offline   Reply With Quote
Old 01-16-2015, 09:13 AM   #18
Glouglou14
Junior Member
 
Location: FRANCE

Join Date: Jan 2011
Posts: 2
Default

Quote:
Originally Posted by shruti View Post
Hi,

I'm also facing the same problem here is case in question. The read names are same.

HWI-ST220_63:6:2108:11340:115090 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIHGIGJIJJIJHGIHGGJJJJJJJJIJJJIJJJIJGHHHHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:2108:11340:115090 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA HHJIJJIIIJJJJJJJJIHDJJJIJIJJJJJIIJJJJHHHHHDFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:1108:12937:121403 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIGJIHGJJJJIHGJJJIJJJJJJIJHJIJIJJJIJHHHFHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
HWI-ST220_63:6:1108:12937:121403 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA JJIIJIJIGJJGIIJJIGE;JJJJIIJJJJIIIJJJJHFHHHFFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

Any pointers would be most appreciated.
I had the same issue than shruti. Here is how i fixed it (in case is usefull for someone else...).
Samtools rmdup was having trouble with "supplementary alignment". I removed them (samtools view -b -F 0x0800 foo.bam > foo.noSupAln.bam) and everything went fine.
Glouglou14 is offline   Reply With Quote
Old 09-17-2015, 05:34 AM   #19
Momina
Member
 
Location: uk

Join Date: Aug 2015
Posts: 19
Default

hi
from which file we have to remove duplicates either from file.bam or file.sorted.bam?
Momina is offline   Reply With Quote
Old 09-17-2015, 12:01 PM   #20
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It's simplest to remove duplicates from a sorted BAM file (in fact, most tools require that the input is sorted).
dpryan is offline   Reply With Quote
Reply

Tags
duplicates, picard, samtools, samtools picard

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO