SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Error with MarkDuplicates in Picard slowsmile Bioinformatics 13 11-01-2015 04:16 AM
How to use Picard's MarkDuplicates cliff Bioinformatics 12 01-26-2015 11:56 PM
Picard's MarkDuplicates -> OutOfMemoryError elgor Bioinformatics 15 08-05-2013 07:37 AM
picard markduplicates on huge files rcorbett Bioinformatics 2 09-17-2010 05:39 AM
Picard MarkDuplicates wangzkai Bioinformatics 2 05-18-2010 10:14 PM

Reply
 
Thread Tools
Old 03-17-2010, 03:30 AM   #1
bair
Member
 
Location: London

Join Date: Jan 2010
Posts: 65
Default MarkDuplicates in picard

Hello all,

What's the metrics file output from Markduplicates function in picard?

Can I get how many reads marked as duplicates in this file?


Thanks
bair is offline   Reply With Quote
Old 03-18-2010, 10:40 PM   #2
mard
Member
 
Location: Melbourne

Join Date: Jan 2010
Posts: 21
Default

Yes it tells you the number of reads that have been marked as duplicates, as well as the total number of reads. But note that reads that Picard marks as duplicates do not necessarily have identical sequence they just map to the same chromosomal location.
mard is offline   Reply With Quote
Old 03-19-2010, 02:36 AM   #3
bair
Member
 
Location: London

Join Date: Jan 2010
Posts: 65
Default

Quote:
Originally Posted by mard View Post
Yes it tells you the number of reads that have been marked as duplicates, as well as the total number of reads. But note that reads that Picard marks as duplicates do not necessarily have identical sequence they just map to the same chromosomal location.
Thanks. How to pick up the duplicates to remove? keep the best alignment one if they do not have identical sequences?

Here is what I got from picard :


## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_
PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 27221401 548559917 190908169 14563968 58165860 0 0.11642 2400441897

## HISTOGRAM java.lang.Double
BIN VALUE
1.0 1
2.0 1.795707
3.0 2.428856
4.0 2.932657
5.0 3.333535
6.0 3.652516
7.0 3.906332
8.0 4.108295

What is this histogram about?

My original bam file has 657624702 paired reads, so 2*657624702 in total. After remove duplicates, bam file has 1184353716 reads in total. So suppose,
2*657624702 - 1184353716 = 130895688 reads removed.

I couldn't get this number from picard output M file, any help?

Thanks
bair is offline   Reply With Quote
Old 12-23-2010, 12:00 PM   #4
psm3426
Junior Member
 
Location: Boston, MA

Join Date: Dec 2010
Posts: 1
Default

The reason for the histogram is one of the FAQ on their wiki.
http://sourceforge.net/apps/mediawik...kDuplicates.3F

The reason that you couldn't get that number is because for read pair duplicates, they divide the actual number of duplicates in half before reporting it. So in your case, 2 * 58165860 (value under paired_read_duplicates) = 130895688, which was the number of duplicates you were missing. =)
psm3426 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:46 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO