Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
duplicate reads removal vasvale Bioinformatics 19 01-08-2015 01:59 AM
Trinity-Duplicate removal reema Bioinformatics 2 02-27-2014 02:49 AM
Duplicate removal without alignment to reference genome curious.genome Illumina/Solexa 8 10-24-2013 11:41 PM
I need basic help with basic questions re: analysis rd69 General 3 02-16-2012 04:11 PM
threshold for duplicate removal? mard Bioinformatics 2 03-21-2010 04:45 PM

Thread Tools
Old 06-26-2014, 08:04 PM   #1
Location: C:/Program files/Google/Chrome

Join Date: Jul 2012
Posts: 34
Default some basic questions about duplicate removal ?

Hi all,

I am using the GATK pipeline for pre processing bam files after alignment with bwa mem. The original bam files after alignment shows I have (samtools flagstat command) - 173,460,757 reads (this is deep sequening exome data captured with agilent sure select 50 mb).

But after removing duplicates with Picard, I am left with 14,651,238 reads !! Thats like mere 20X coverage.

1. I would like to know whether this is normal in exome seq to find such huge amount duplicates? And some of the threads on other forums say its not wise to remove duplicates from deep sequencing data. Can anyone provide me some suggestions on this, like how you guys proceed in such scenario ?

2. And what is the difference between marking duplicates and removing duplicates ? I know marking adds a tag instead of completely removing the read. But, if the duplicate marked reads are not used in any of the downstream steps (like SNP calling) why is it suggested to simply marking it instead of removing it?

3. And while calculating coverage do we have to consider duplicate reads as well (original bam) or the final bam file with dups removed ?

Thank you.
a_mt is offline   Reply With Quote
Old 06-27-2014, 10:14 AM   #2
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

If you had low library complexity due to insufficient DNA, overamplification, contamination, or highly-biased capture, a lot of duplicates will be present. Sounds like there were problems with your library prep and maybe it should be redone; that level is much higher than I'd expect. But, just run pileup and see if there is enough coverage for whatever you're doing, which depends on the fraction of the area covered to at least X depth rather than the average coverage.

PCR duplicates should be removed before calling variations. But I would suggest removing only exact duplicate reads, rather than anything mapping to the same location even if they have some different base calls. And just to clarify, are these paired reads that you're removing based on both reads mapping to the same location?

Removing duplicates rather than marking them is more efficient as downstream programs don't need to process as much data. But, you can use marked duplicates to generate consensus if you want, when reads are low quality.

I use the unique coverage when calling variations as it's more relevant.
Brian Bushnell is offline   Reply With Quote

coverage, markduplicates, pcr duplicates

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 10:14 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO