SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Error with MarkDuplicates in Picard slowsmile Bioinformatics 13 11-01-2015 04:16 AM
Picard's MarkDuplicates -> OutOfMemoryError elgor Bioinformatics 15 08-05-2013 07:37 AM
MarkDuplicates in picard bair Bioinformatics 3 12-23-2010 12:00 PM
picard markduplicates on huge files rcorbett Bioinformatics 2 09-17-2010 05:39 AM
Picard MarkDuplicates wangzkai Bioinformatics 2 05-18-2010 10:14 PM

Reply
 
Thread Tools
Old 06-12-2010, 11:12 AM   #1
cliff
Member
 
Location: USA

Join Date: Oct 2009
Posts: 41
Default How to use Picard's MarkDuplicates

I just tried Picard to remove PCR duplicates and used the test_sorted.bam (obtained by using samtools sort) as the input file. My following command

java -jar MarkDuplicates.jar test_sorted.bam test_rmdup.bam

gave me an error

ERROR: Invalid argument 'test_sorted.bam'.

Anybody knows where I did wrong?

Thanks for all your help in advance.
cliff is offline   Reply With Quote
Old 06-12-2010, 05:25 PM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by cliff View Post
I just tried Picard to remove PCR duplicates and used the test_sorted.bam (obtained by using samtools sort) as the input file. My following command

java -jar MarkDuplicates.jar test_sorted.bam test_rmdup.bam

gave me an error

ERROR: Invalid argument 'test_sorted.bam'.

Anybody knows where I did wrong?

Thanks for all your help in advance.
Try it without any arguments to see how to specify input and output files. The command is different from samtools.
nilshomer is offline   Reply With Quote
Old 06-12-2010, 07:15 PM   #3
cliff
Member
 
Location: USA

Join Date: Oct 2009
Posts: 41
Default

I tried again

java -Xmx2g -jar ~/picard-tools-1.21/MarkDuplicates.jar INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true

And I got this error:

[Sat Jun 12 22:11:22 EDT 2010] net.sf.picard.sam.MarkDuplicates INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 TMP_DIR=/tmp/cliff VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
INFO 2010-06-12 22:11:22 MarkDuplicates Start of doWork freeMemory: 31062256; totalMemory: 31588352; maxMemory: 1908932608
INFO 2010-06-12 22:11:22 MarkDuplicates Reading input file and constructing read end information.
INFO 2010-06-12 22:11:22 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
[Sat Jun 12 22:11:23 EDT 2010] net.sf.picard.sam.MarkDuplicates done.
Runtime.totalMemory()=152829952
Exception in thread "main" net.sf.picard.PicardException: test_sorted.bam is not coordinate sorted.
at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:250)
at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:112)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:96)


It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

where did I do wrong?..
cliff is offline   Reply With Quote
Old 06-12-2010, 10:03 PM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by cliff View Post
It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

where did I do wrong?..
Nowhere, this is samtools' fault. The SAM specification lists a header (HD) tag for sort order (SO). The three permissible values for this tag are "unsorted", "coordinate", indicating that the entries have been sorted by chromosome and start position, and "queryname", meaning the file is sorted by the read IDs. When you sort the file with samtools it does not update the SO tag to reflect the fact the file has been sorted. According to the author of samtools, the SAM specification does not require this so it is not a bug (see this thread). Perhaps not but it's damned annoying.

You can view the header information for your bam file with the command
Code:
samtools view -H test_sorted.bam
Picard reads the SO tag to determine whether or not the file is sorted. This is obviously much easier and more efficient than actually checking every line of the file to determine whether or not it has been sorted.

Before you can use Picard to remove duplicates you will have to fix the SO tag. Fourtunately Picard has a command to this, ReplaceSamHeader. Alternatively you could use the Picard SortSam instead of the samtools sort (For the record I don't know for sure if Picard SortSam properly updates the SO tag.)
kmcarr is offline   Reply With Quote
Old 06-12-2010, 10:35 PM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

You can also add the "AS=true" option to assume that the input is sorted.
nilshomer is offline   Reply With Quote
Old 06-23-2010, 06:38 AM   #6
bosTau2
Member
 
Location: Antwerp, BE or Cambrigde, UK

Join Date: Nov 2008
Posts: 12
Default

Thanks. I got the exactly same problem...
bosTau2 is offline   Reply With Quote
Old 07-20-2010, 11:54 AM   #7
mmuratet
Member
 
Location: Huntsville AL

Join Date: Jul 2008
Posts: 13
Default Definition of 'coordinate sorted'?

Greetings
I'm having the same problem. I used the command line argument to assume it was sorted but I'm getting screwy results. When the MarkDuplicates method says it wants 'coordinate sorted' data are they referring to tile-x-y or a genomic alignment? It seems one could find duplicates without reference to a genome. If it's tile-x-y then is it lexical or numeric?
Thanks
Mike
mmuratet is offline   Reply With Quote
Old 07-20-2010, 01:17 PM   #8
Lee Sam
Member
 
Location: Ann Arbor, MI

Join Date: Oct 2008
Posts: 57
Default

The simple solution is to use samtools sort the file first. I've been using the Picard tools MergeSamFiles.jar to both merge and sort because I typically have multiple lanes of data for each sample.

Mike, I don't think it will work without being aligned because I believe that Picard works by looking at the mappings.
Lee Sam is offline   Reply With Quote
Old 07-20-2010, 02:41 PM   #9
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by mmuratet View Post
Greetings
I'm having the same problem. I used the command line argument to assume it was sorted but I'm getting screwy results. When the MarkDuplicates method says it wants 'coordinate sorted' data are they referring to tile-x-y or a genomic alignment? It seems one could find duplicates without reference to a genome. If it's tile-x-y then is it lexical or numeric?
Thanks
Mike
Coordinate sorted means sorted by their genomic alignment coordinates. Picard identifies duplicates as those reads mapping to the identical coordinates on the genome; obviously this task is made immensely easier if the alignments are already sorted.

Yes, you could find duplicates without reference to a genome. You would have to perform an all vs. all search which would require an huge amount of time and RAM when you are talking about tens or hundreds of million reads.
kmcarr is offline   Reply With Quote
Old 08-12-2010, 05:32 AM   #10
thomasvangurp
Member
 
Location: Wageningen

Join Date: Jan 2009
Posts: 11
Default

I would like to use Picard duplicate removal also. However, i ran into some trouble using a SAM-file outputted by CLC-Bio Genomics workbench. Anyone had an idead how to fix this issue?

Code:
root@thomasg-desktop:/home/thomasg/Downloads/\tmp/picard-tools-1.27# java -jar MergeSamFiles.jar I=/home/thomasg/RF_7.fastq\ trimmed\ \(paired\)\ mapping\ \(11205\ references\).sam SO=coordinate AS=false O=/home/thomasg/out.sam
[Thu Aug 12 14:30:53 CEST 2010] net.sf.picard.sam.MergeSamFiles OUTPUT=/home/thomasg/out.sam SORT_ORDER=coordinate ASSUME_SORTED=false    MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false TMP_DIR=/tmp/root VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
INFO	2010-08-12 14:30:53	MergeSamFiles	Sorting input files using temp directory /tmp/root
[Thu Aug 12 14:30:53 CEST 2010] net.sf.picard.sam.MergeSamFiles done.
Runtime.totalMemory()=379322368
Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Paired read should be marked as first of pair or second of pair.; File /home/thomasg/RF_7.fastq trimmed (paired) mapping (11205 references).sam; Line 11208
Line: RF_43280	25	Contig_1	1	60	50M	*	0	0	ACAGCGACTCAACCAAAGGAATCCTATATAGAAATGCTATTAGGAATCCC	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	NH:i:1
	at net.sf.samtools.SAMTextReader.reportErrorParsingLine(SAMTextReader.java:220)
	at net.sf.samtools.SAMTextReader.access$500(SAMTextReader.java:40)
	at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:424)
	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:268)
	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:240)
	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:609)
	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:587)
	at net.sf.picard.util.PeekableIterator.advance(PeekableIterator.java:71)
	at net.sf.picard.util.PeekableIterator.<init>(PeekableIterator.java:41)
	at net.sf.picard.sam.ComparableSamRecordIterator.<init>(ComparableSamRecordIterator.java:51)
	at net.sf.picard.sam.MergingSamRecordIterator.addIterator(MergingSamRecordIterator.java:93)
	at net.sf.picard.sam.MergingSamRecordIterator.startIterationIfRequired(MergingSamRecordIterator.java:102)
	at net.sf.picard.sam.MergingSamRecordIterator.hasNext(MergingSamRecordIterator.java:117)
	at net.sf.picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:190)
	at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
	at net.sf.picard.sam.MergeSamFiles.main(MergeSamFiles.java:83)
thomasvangurp is offline   Reply With Quote
Old 08-12-2010, 07:20 AM   #11
mmuratet
Member
 
Location: Huntsville AL

Join Date: Jul 2008
Posts: 13
Default Picard duplicate removal problem

I had a similar problem with sam files derived from Illumina output. The problem was the mate IDs that Illumina uses, i.e., index:pairN:filterFlag. I believe the tools expect pair IDs in the form /1 and /2. Check the output from the workbench to see how they identify pairs.
mmuratet is offline   Reply With Quote
Old 12-22-2010, 05:47 AM   #12
scientifica
Junior Member
 
Location: The Netherlands

Join Date: Jan 2010
Posts: 3
Default

Dear all,

For my sequencing project I would also like to remove duplicates. Did any of you already work with the CLC Assembly Cell to remove them?
I have no idea where to start.
__________________
Time is a great teacher. Unfortunately, it kills all its pupils.
scientifica is offline   Reply With Quote
Old 01-26-2015, 11:56 PM   #13
shanlan.mo
Junior Member
 
Location: Beijing.China

Join Date: Jan 2015
Posts: 1
Smile

Quote:
Originally Posted by cliff View Post
I tried again

java -Xmx2g -jar ~/picard-tools-1.21/MarkDuplicates.jar INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true

And I got this error:

[Sat Jun 12 22:11:22 EDT 2010] net.sf.picard.sam.MarkDuplicates INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 TMP_DIR=/tmp/cliff VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
INFO 2010-06-12 22:11:22 MarkDuplicates Start of doWork freeMemory: 31062256; totalMemory: 31588352; maxMemory: 1908932608
INFO 2010-06-12 22:11:22 MarkDuplicates Reading input file and constructing read end information.
INFO 2010-06-12 22:11:22 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
[Sat Jun 12 22:11:23 EDT 2010] net.sf.picard.sam.MarkDuplicates done.
Runtime.totalMemory()=152829952
Exception in thread "main" net.sf.picard.PicardException: test_sorted.bam is not coordinate sorted.
at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:250)
at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:112)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:96)


It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

where did I do wrong?..
the bam is sorted by Picardtools ,suchjava -jar $softwave/SamFormatConverter.jar I=$I/HFHm001_1_Tri.fastq_bismark_bt2_pe.sam o=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.bam
java -jar $softwave/SortSam.jar I=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.bam O=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.sorted.bam sort_order=coordinate
shanlan.mo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO