![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
example for using Picard removing duplicate reads? | fabrice | Bioinformatics | 9 | 10-18-2013 03:32 AM |
Problem to use Bedtools after filtering uniquely mapped reads with samtools | eilosei | Bioinformatics | 2 | 12-21-2011 04:51 PM |
Removing similar sequence reads | loba17 | Bioinformatics | 4 | 10-17-2011 08:31 AM |
Removing duplicate reads for tophat? | hong_sunwoo | RNA Sequencing | 2 | 10-09-2010 01:46 AM |
Removing duplicate reads from multigig .csfasta | Bueller_007 | Bioinformatics | 7 | 06-26-2010 04:07 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: California Join Date: Feb 2011
Posts: 1
|
![]()
Hello,
I am having trouble removing duplicate reads with samtools and picard. I have a sam/bam file that is of paired-end illumina reads mapped to the genome (minimal file below). I have coordinate-sorted the sam/bam file and tried both: java -jar picard-tools-1.38/MarkDuplicates.jar I=problemSorted.bam O=out.sam M=log.txt REMOVE_DUPLICATES=true -and- samtools rmdup problemSorted.bam out.bam and neither seem to remove the lines that I believe are duplicates. The two read pairs have the exact same sequence and mapping, so I believe I should only have one pair after removing duplicates. I have read that samtools may not work in this case because the two ends of the reads map to different chromosomes, but I am surprised that picard is not removing them. The duplicate reads are removed by: samtools rmdup problemSorted.bam out.bam -S but I am worried that treating all my paired-end reads as single reads when removing duplicates may lead to other issues. Thanks for your help! Minimal input file is below: @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:28195914 @SQ SN:chr2 LN:19369704 HWI-EAS000_1:5:71:14037:15683:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC IIIIIIIIIIGIIIIIHIIIIIGIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U HWI-EAS000_1:5:80:6679:2759:0:1:1 113 chr1 11732000 37 36M chr2 54321 0 CTCCCATCTCTATTCCATTTCCTCTGCCATGTATTC HGIIIIHIHHIIIHIIIIIIIIHIIIIIIIIIIIIH X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U HWI-EAS000_1:5:71:14037:15683:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIIIIIIIHIHIHIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U HWI-EAS000_1:5:80:6679:2759:0:2:1 177 chr2 54321 37 36M chr1 11732000 0 GTATGTACTGTATTATCTGAGTTTTTTATTCACAAG IIIIHIIIHIIIIIIIIIIIIIIIIIIIIIIIIIII X0:i:1 X1:i:0 MD:Z:36 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U |
![]() |
![]() |
![]() |
#2 |
Member
Location: United States Join Date: Sep 2008
Posts: 27
|
![]()
According to the SAM format specification (http://samtools.sourceforge.net/SAM1.pdf) , read names must be the same for read pairs. Your read pairs do not seem to follow this rule, so it appears they are not being recognized as pairs by MarkDuplicates.
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Milan Join Date: Mar 2010
Posts: 35
|
![]()
Hi,
I'm also facing the same problem here is case in question. The read names are same. HWI-ST220_63:6:2108:11340:115090 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIHGIGJIJJIJHGIHGGJJJJJJJJIJJJIJJJIJGHHHHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 HWI-ST220_63:6:2108:11340:115090 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA HHJIJJIIIJJJJJJJJIHDJJJIJIJJJJJIIJJJJHHHHHDFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 HWI-ST220_63:6:1108:12937:121403 177 chr1 14480159 37 50M chr5 84273185 0 GCTACCACTCAGAAACTTGGAGAAATCACAGAGAGCCATGGACTTTATTT JIIGJIHGJJJJIHGJJJIJJJJJJIJHJIJIJJJIJHHHFHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:3X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 HWI-ST220_63:6:1108:12937:121403 113 chr5 84273185 37 50M chr1 14480159 0 AACAGAATTCCAGTGAGATACCTTCTTCCTTCTCAGGATTTTACATGTTA JJIIJIJIGJJGIIJJIGE;JJJJIIJJJJIIIJJJJHFHHHFFFFFCCC XT:A:U NM:i:0 SM:i:3AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 Any pointers would be most appreciated. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Palo Alto Join Date: Apr 2009
Posts: 213
|
![]()
Need to include the command you used, etc.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog] Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post] Projects: U87MG whole genome sequence [Website] [Paper] |
![]() |
![]() |
![]() |
#5 |
Member
Location: Milan Join Date: Mar 2010
Posts: 35
|
![]()
I had 1st sorted the file
samtools sort BothEndsMapped.bam BothEndsMapped.sorted and then rmdup.. samtools rmdup BothEndsMapped.sorted.bam BothEndsMapped.sorted.unique.bam I guess by default it does for paired end reads. |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: East Coast, US Join Date: Jun 2010
Posts: 177
|
![]()
Hi cbl,
Can you explain to us why duplicate removal, especially removing the pairs that map to different chr. is critical to your analysis? I always treat duplicate removal as nice-to-have but not must-to-have. Thank you, Douglas www.contigexpress.com |
![]() |
![]() |
![]() |
#7 |
Member
Location: Milan Join Date: Mar 2010
Posts: 35
|
![]()
Well I'm looking for cases of rearrangements hence, reads mapping in different chromosomes is important.
As far as removing duplicates is considered, its quite subjective to the need as well as coverage. To start with being strict on the data I'm removing the duplicates. Cases of different reads showing same rearrangement would be say 90% chance that they are case of rearrangement (provided they are not into repeats/segmental duplicates). In case of low depth of coverage it would be quite difficult distinguish rearrangements due to the real data and PCR duplicates, so have to be little strict on that front. |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: East Coast, US Join Date: Jun 2010
Posts: 177
|
![]()
Hi shruti,
I understand your points. For me, I use a different approach in rearrangement analysis. I rely on a very tiny subset of the raw reads to infer the rearrangement. So at the beginning I use very nonrestrictive criteria to locate candidate reads that indicate potential rearrangement. Then I will dive into these reads. In the context of this post, including duplicate reads at the beginning does not hurt your chance finding/validating the rearrangement. When you have a small subset of candidate reads, you may look into duplication, most likely with custom scripts. So my suggestion here is not to dwell too much on removing duplicated reads at the beginning as some general-purpose tools (e.g., samtools) may not be suitable for your specific needs. Douglas www.contigexpress.com |
![]() |
![]() |
![]() |
#9 |
Member
Location: Milan Join Date: Mar 2010
Posts: 35
|
![]()
Well..yes I guess one can start with being lenient. Just curious, if you start with these reads, then how do you filter at the end. or do you also experimentally validate these cases. how much of a false positive do you observe.
|
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: East Coast, US Join Date: Jun 2010
Posts: 177
|
![]()
Hi shruti,
I usually use custom scripts to screen for 1) whether they map to the same chr. 2) how far apart and in which orientation 3) mapping quality and ambiguity 4) read counts (independent or PCR-duplication?). I am reluctant to put a hard number on false positive but I agree with you that it is critical to experimentally follow up and validate the models obtained from next-gen data. Douglas www.contigexpress.com |
![]() |
![]() |
![]() |
#11 | ||
Senior Member
Location: Palo Alto Join Date: Apr 2009
Posts: 213
|
![]() Quote:
If you add more responses, copy and paste the exact command. Remove file names if they're sensitive. Quote:
In contrast, numerous analyses have demonstrated beneficial effects of removing duplicates. Many discussions of this have been done on this site in the past. http://seqanswers.com/forums/showthread.php?t=6854 for example.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog] Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post] Projects: U87MG whole genome sequence [Website] [Paper] |
||
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: East Coast, US Join Date: Jun 2010
Posts: 177
|
![]()
Hi Michael.James.Clark,
I looked at the thread you recommended - http://seqanswers.com/forums/showthread.php?t=6854. It appears to me that you were the only one who positively and absolutely advocate for duplicate removal. Others were either not sure or took a balanced approach as to the potential pros and cons. I certainly recognize the benefits of duplicate removal, due to the limitations of current laboratory technology and bioinformatic methods, I would not put it as must-have as long as the user considers the impact of duplicate when interpreting the analysis results. Douglas www.contigexpress.com |
![]() |
![]() |
![]() |
#13 | |
Senior Member
Location: Palo Alto Join Date: Apr 2009
Posts: 213
|
![]() Quote:
![]() I will clarify my statement for you: Were we talking about ChIP-seq or RNA-seq here, I might agree that duplicate removal is a questionable step to take. Certainly if we were talking about single-end data, I would say that it's probably not necessary (and, in fact, is likely throwing the baby out with the bathwater in many cases). However, it seems this is paired WGS (or some other DNA-seq) with the goal of variant detection. In that case, de-dupping has been shown to be beneficial for variant calling (SNVs, indels, SVs, CNVs). With regards to SVs, de-dupping is particularly useful for long mate-paired data. For simple PE data at adequate depth, it may have little impact. However, I've not seen anyone demonstrate it hurts SV detection in those cases, therefore why not do it? Especially if you'll be using that same data for SNV calling, where it has demonstrably significant effects.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog] Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post] Projects: U87MG whole genome sequence [Website] [Paper] |
|
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: East Coast, US Join Date: Jun 2010
Posts: 177
|
![]()
Hi Michael.James.Clark,
Thank you for your additional comment. I would not characterize my opinion on duplicate removal as "questionable" step. I completely recognize the benefits of doing it at the beginning of the pipeline, but would not take it as a mandatory step - "must-have". In this particular case, cbl had executed the duplicate removal. My suggestion to cbl was to go ahead with the downstream analysis instead of halting the project progress. The mere fact that both programs, samtools and picard, have built-in functions to remove duplicates but one may work and the other may not work in a particular case exactly illustrates there is a judgment call on duplicate removal. It of course has value to find a solution that can remove the duplicates that are mapped to diffferent chr. and I hope your suggestion of picard will resolve that. Douglas www.contigexpress.com |
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: Palo Alto Join Date: Apr 2009
Posts: 213
|
![]() Quote:
See here: http://sourceforge.net/apps/mediawik...tools_rmdup.3F "samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates." That's because of the way samtools rmdup works... it looks for successive reads mapping to the same spot after sorting without considering the pair. It's not because the authors intended it to miss those from what I understand--it's because way back when they wrote it, they didn't consider that case. (By the way, that also explains why shruti is seeing interchromosomal duplicates in her results, and why I told her to use Picard rather than samtools in the first place. ![]() Also, I'm not sure actually if samtools rmdup considers the second end at all. I think it does, but I think it simply doesn't work on interchromosomals. Either way, it's safest to just use Picard from what I've seen.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog] Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post] Projects: U87MG whole genome sequence [Website] [Paper] Last edited by Michael.James.Clark; 07-24-2011 at 05:35 PM. |
|
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: East Coast, US Join Date: Jun 2010
Posts: 177
|
![]()
Hi Michael.James.Clark,
Thank you for the clarfication. I did not know actually the difference was clearly documented. I learn a few new things in this field every day! This may be suited for a new thread but what's your view on duplicate removal for single-end reads? Regards, Douglas www.contigexpress.com |
![]() |
![]() |
![]() |
#17 |
Member
Location: Milan Join Date: Mar 2010
Posts: 35
|
![]()
Hi Michael,
Thanks for the info. I was not aware of this limitation of rmdup. Will try Picard and see. (and next time I would be cautious of the filenames I write) Hi Dzhang, Thanks for the info on your pipeline. |
![]() |
![]() |
![]() |
#18 | |
Junior Member
Location: FRANCE Join Date: Jan 2011
Posts: 2
|
![]() Quote:
Samtools rmdup was having trouble with "supplementary alignment". I removed them (samtools view -b -F 0x0800 foo.bam > foo.noSupAln.bam) and everything went fine. |
|
![]() |
![]() |
![]() |
#19 |
Member
Location: uk Join Date: Aug 2015
Posts: 19
|
![]()
hi
from which file we have to remove duplicates either from file.bam or file.sorted.bam? |
![]() |
![]() |
![]() |
#20 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
It's simplest to remove duplicates from a sorted BAM file (in fact, most tools require that the input is sorted).
|
![]() |
![]() |
![]() |
Tags |
duplicates, picard, samtools, samtools picard |
Thread Tools | |
|
|