![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Removing duplicates is it really necessary? | foxyg | Bioinformatics | 34 | 12-02-2016 02:17 PM |
Samtools flagstat - no duplicates? | Orr Shomroni | Bioinformatics | 3 | 11-25-2011 01:46 AM |
Find the segemntal duplicates | ardmore | Bioinformatics | 10 | 11-10-2011 07:21 AM |
Read Duplicates | hlmeng | Bioinformatics | 1 | 02-15-2011 07:17 PM |
cufflinks errors of duplicates | middlemale | Bioinformatics | 3 | 10-25-2010 06:01 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: SEA Join Date: Nov 2009
Posts: 203
|
![]()
Hi is 26% duplicates an extraordinarily high number for single end sureselect targetted SOLiD reads?
Also I presume the duplicates are in part of the mapped reads as well? I used Picard's markduplicates to arrive at the rmdup bam. ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE Unknown Library 61303170 0 40652492 26844757 0 0 0.437902 101955662 in total 0 QC failure 26844757 duplicates 61303170 mapped (60.13%) 0 paired in sequencing 0 read1 0 read2 0 properly paired (nan%) 0 with itself and mate mapped 0 singletons (nan%) 0 with mate mapped to a different chr 0 with mate mapped to a different chr (mapQ>=5)
__________________
http://kevin-gattaca.blogspot.com/ Last edited by KevinLam; 08-16-2010 at 10:36 PM. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Stockholm, Sweden Join Date: Feb 2008
Posts: 319
|
![]()
I wouldn't say it's extraordinary, although it is quite high. I've certainly seen higher. Depends a bit on the sample too - if you have coverage that is very high compared to the DNA represented in the sample, you will get many duplicates (you will start to sequence the same things over and over again).
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: SEA Join Date: Nov 2009
Posts: 203
|
![]()
Hi Kopi-o,
From my understanding, the PCR duplicates are marked by exact seq and physical proximity of the beads based on the read names pertaining to the platform. I can understand if it is additional coverage due to randomness. But I am concerned if perhaps I need to optimise the emulsion PCR step? or should I forget about removing duplicates at all? (since it is actually not marking the PCR duplicates but duplicates?)
__________________
http://kevin-gattaca.blogspot.com/ |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Stockholm, Sweden Join Date: Feb 2008
Posts: 319
|
![]()
I really wouldn't dare to suggest a specific course of action ... it depends on the application you have (standard answer!). You might want to check how many of the duplicates have the exact same sequence (by using Unix sort, for example) and how many just map to the same locations (with sequence differences). That would at least tell you something.
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Phoenix, AZ Join Date: Mar 2010
Posts: 279
|
![]()
Probably not ridiculous if this is only a 50bp SE frag run, which after you remove duplicates means you can only get 50x coverage max. If you apply the birthday problem to this type of probability situation to infer what the chance is that a mapped read, which encompases a given base, is unique you will find it gets extremely discouraging after you achive 20x unique coverage. Unfortunately, this is a situation were PE runs make a huge difference to the number/percentage of duplicates.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|