SEQanswers (
-   Bioinformatics (
-   -   some basic questions about duplicate removal ? (

a_mt 06-26-2014 07:04 PM

some basic questions about duplicate removal ?
Hi all,

I am using the GATK pipeline for pre processing bam files after alignment with bwa mem. The original bam files after alignment shows I have (samtools flagstat command) - 173,460,757 reads (this is deep sequening exome data captured with agilent sure select 50 mb).

But after removing duplicates with Picard, I am left with 14,651,238 reads !! Thats like mere 20X coverage.

1. I would like to know whether this is normal in exome seq to find such huge amount duplicates? And some of the threads on other forums say its not wise to remove duplicates from deep sequencing data. Can anyone provide me some suggestions on this, like how you guys proceed in such scenario ?

2. And what is the difference between marking duplicates and removing duplicates ? I know marking adds a tag instead of completely removing the read. But, if the duplicate marked reads are not used in any of the downstream steps (like SNP calling) why is it suggested to simply marking it instead of removing it?

3. And while calculating coverage do we have to consider duplicate reads as well (original bam) or the final bam file with dups removed ?

Thank you.

Brian Bushnell 06-27-2014 09:14 AM

If you had low library complexity due to insufficient DNA, overamplification, contamination, or highly-biased capture, a lot of duplicates will be present. Sounds like there were problems with your library prep and maybe it should be redone; that level is much higher than I'd expect. But, just run pileup and see if there is enough coverage for whatever you're doing, which depends on the fraction of the area covered to at least X depth rather than the average coverage.

PCR duplicates should be removed before calling variations. But I would suggest removing only exact duplicate reads, rather than anything mapping to the same location even if they have some different base calls. And just to clarify, are these paired reads that you're removing based on both reads mapping to the same location?

Removing duplicates rather than marking them is more efficient as downstream programs don't need to process as much data. But, you can use marked duplicates to generate consensus if you want, when reads are low quality.

I use the unique coverage when calling variations as it's more relevant.

All times are GMT -8. The time now is 05:33 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.