![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
duplicate reads removal | vasvale | Bioinformatics | 19 | 01-08-2015 12:59 AM |
Trinity-Duplicate removal | reema | Bioinformatics | 2 | 02-27-2014 01:49 AM |
Duplicate removal without alignment to reference genome | curious.genome | Illumina/Solexa | 8 | 10-24-2013 10:41 PM |
I need basic help with basic questions re: analysis | rd69 | General | 3 | 02-16-2012 03:11 PM |
threshold for duplicate removal? | mard | Bioinformatics | 2 | 03-21-2010 03:45 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: C:/Program files/Google/Chrome Join Date: Jul 2012
Posts: 34
|
![]()
Hi all,
I am using the GATK pipeline for pre processing bam files after alignment with bwa mem. The original bam files after alignment shows I have (samtools flagstat command) - 173,460,757 reads (this is deep sequening exome data captured with agilent sure select 50 mb). But after removing duplicates with Picard, I am left with 14,651,238 reads !! Thats like mere 20X coverage. 1. I would like to know whether this is normal in exome seq to find such huge amount duplicates? And some of the threads on other forums say its not wise to remove duplicates from deep sequencing data. Can anyone provide me some suggestions on this, like how you guys proceed in such scenario ? 2. And what is the difference between marking duplicates and removing duplicates ? I know marking adds a tag instead of completely removing the read. But, if the duplicate marked reads are not used in any of the downstream steps (like SNP calling) why is it suggested to simply marking it instead of removing it? 3. And while calculating coverage do we have to consider duplicate reads as well (original bam) or the final bam file with dups removed ? Thank you. |
![]() |
![]() |
![]() |
#2 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
If you had low library complexity due to insufficient DNA, overamplification, contamination, or highly-biased capture, a lot of duplicates will be present. Sounds like there were problems with your library prep and maybe it should be redone; that level is much higher than I'd expect. But, just run pileup and see if there is enough coverage for whatever you're doing, which depends on the fraction of the area covered to at least X depth rather than the average coverage.
PCR duplicates should be removed before calling variations. But I would suggest removing only exact duplicate reads, rather than anything mapping to the same location even if they have some different base calls. And just to clarify, are these paired reads that you're removing based on both reads mapping to the same location? Removing duplicates rather than marking them is more efficient as downstream programs don't need to process as much data. But, you can use marked duplicates to generate consensus if you want, when reads are low quality. I use the unique coverage when calling variations as it's more relevant. |
![]() |
![]() |
![]() |
Tags |
coverage, markduplicates, pcr duplicates |
Thread Tools | |
|
|