SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing duplicates is it really necessary? foxyg Bioinformatics 34 12-02-2016 02:17 PM
Samtools flagstat - no duplicates? Orr Shomroni Bioinformatics 3 11-25-2011 01:46 AM
Find the segemntal duplicates ardmore Bioinformatics 10 11-10-2011 07:21 AM
Read Duplicates hlmeng Bioinformatics 1 02-15-2011 07:17 PM
cufflinks errors of duplicates middlemale Bioinformatics 3 10-25-2010 06:01 AM

Reply
 
Thread Tools
Old 08-16-2010, 08:06 PM   #1
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default 26% duplicates marked by Picard from bowtie alignment

Hi is 26% duplicates an extraordinarily high number for single end sureselect targetted SOLiD reads?
Also I presume the duplicates are in part of the mapped reads as well?

I used Picard's markduplicates to arrive at the rmdup bam.

## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 61303170 0 40652492 26844757 0 0 0.437902

101955662 in total
0 QC failure
26844757 duplicates
61303170 mapped (60.13%)
0 paired in sequencing
0 read1
0 read2
0 properly paired (nan%)
0 with itself and mate mapped
0 singletons (nan%)
0 with mate mapped to a different chr
0 with mate mapped to a different chr (mapQ>=5)

Last edited by KevinLam; 08-16-2010 at 10:36 PM.
KevinLam is offline   Reply With Quote
Old 08-17-2010, 11:41 AM   #2
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

I wouldn't say it's extraordinary, although it is quite high. I've certainly seen higher. Depends a bit on the sample too - if you have coverage that is very high compared to the DNA represented in the sample, you will get many duplicates (you will start to sequence the same things over and over again).
kopi-o is offline   Reply With Quote
Old 08-17-2010, 07:22 PM   #3
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Hi Kopi-o,
From my understanding, the PCR duplicates are marked by exact seq and physical proximity of the beads based on the read names pertaining to the platform.

I can understand if it is additional coverage due to randomness. But I am concerned if perhaps I need to optimise the emulsion PCR step?
or should I forget about removing duplicates at all? (since it is actually not marking the PCR duplicates but duplicates?)
KevinLam is offline   Reply With Quote
Old 08-18-2010, 11:42 AM   #4
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

I really wouldn't dare to suggest a specific course of action ... it depends on the application you have (standard answer!). You might want to check how many of the duplicates have the exact same sequence (by using Unix sort, for example) and how many just map to the same locations (with sequence differences). That would at least tell you something.
kopi-o is offline   Reply With Quote
Old 08-19-2010, 05:20 PM   #5
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Probably not ridiculous if this is only a 50bp SE frag run, which after you remove duplicates means you can only get 50x coverage max. If you apply the birthday problem to this type of probability situation to infer what the chance is that a mapped read, which encompases a given base, is unique you will find it gets extremely discouraging after you achive 20x unique coverage. Unfortunately, this is a situation were PE runs make a huge difference to the number/percentage of duplicates.
Jon_Keats is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:44 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO