Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cliff
    Member
    • Oct 2009
    • 41

    How to use Picard's MarkDuplicates

    I just tried Picard to remove PCR duplicates and used the test_sorted.bam (obtained by using samtools sort) as the input file. My following command

    java -jar MarkDuplicates.jar test_sorted.bam test_rmdup.bam

    gave me an error

    ERROR: Invalid argument 'test_sorted.bam'.

    Anybody knows where I did wrong?

    Thanks for all your help in advance.
  • nilshomer
    Nils Homer
    • Nov 2008
    • 1283

    #2
    Originally posted by cliff View Post
    I just tried Picard to remove PCR duplicates and used the test_sorted.bam (obtained by using samtools sort) as the input file. My following command

    java -jar MarkDuplicates.jar test_sorted.bam test_rmdup.bam

    gave me an error

    ERROR: Invalid argument 'test_sorted.bam'.

    Anybody knows where I did wrong?

    Thanks for all your help in advance.
    Try it without any arguments to see how to specify input and output files. The command is different from samtools.

    Comment

    • cliff
      Member
      • Oct 2009
      • 41

      #3
      I tried again

      java -Xmx2g -jar ~/picard-tools-1.21/MarkDuplicates.jar INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true

      And I got this error:

      [Sat Jun 12 22:11:22 EDT 2010] net.sf.picard.sam.MarkDuplicates INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 TMP_DIR=/tmp/cliff VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
      INFO 2010-06-12 22:11:22 MarkDuplicates Start of doWork freeMemory: 31062256; totalMemory: 31588352; maxMemory: 1908932608
      INFO 2010-06-12 22:11:22 MarkDuplicates Reading input file and constructing read end information.
      INFO 2010-06-12 22:11:22 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
      [Sat Jun 12 22:11:23 EDT 2010] net.sf.picard.sam.MarkDuplicates done.
      Runtime.totalMemory()=152829952
      Exception in thread "main" net.sf.picard.PicardException: test_sorted.bam is not coordinate sorted.
      at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:250)
      at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:112)
      at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
      at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:96)


      It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

      where did I do wrong?..

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Originally posted by cliff View Post
        It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

        where did I do wrong?..
        Nowhere, this is samtools' fault. The SAM specification lists a header (HD) tag for sort order (SO). The three permissible values for this tag are "unsorted", "coordinate", indicating that the entries have been sorted by chromosome and start position, and "queryname", meaning the file is sorted by the read IDs. When you sort the file with samtools it does not update the SO tag to reflect the fact the file has been sorted. According to the author of samtools, the SAM specification does not require this so it is not a bug (see this thread). Perhaps not but it's damned annoying.

        You can view the header information for your bam file with the command
        Code:
        samtools view -H test_sorted.bam
        Picard reads the SO tag to determine whether or not the file is sorted. This is obviously much easier and more efficient than actually checking every line of the file to determine whether or not it has been sorted.

        Before you can use Picard to remove duplicates you will have to fix the SO tag. Fourtunately Picard has a command to this, ReplaceSamHeader. Alternatively you could use the Picard SortSam instead of the samtools sort (For the record I don't know for sure if Picard SortSam properly updates the SO tag.)

        Comment

        • nilshomer
          Nils Homer
          • Nov 2008
          • 1283

          #5
          You can also add the "AS=true" option to assume that the input is sorted.

          Comment

          • bosTau2
            Member
            • Nov 2008
            • 12

            #6
            Thanks. I got the exactly same problem...

            Comment

            • mmuratet
              Member
              • Jul 2008
              • 13

              #7
              Definition of 'coordinate sorted'?

              Greetings
              I'm having the same problem. I used the command line argument to assume it was sorted but I'm getting screwy results. When the MarkDuplicates method says it wants 'coordinate sorted' data are they referring to tile-x-y or a genomic alignment? It seems one could find duplicates without reference to a genome. If it's tile-x-y then is it lexical or numeric?
              Thanks
              Mike

              Comment

              • Lee Sam
                Member
                • Oct 2008
                • 57

                #8
                The simple solution is to use samtools sort the file first. I've been using the Picard tools MergeSamFiles.jar to both merge and sort because I typically have multiple lanes of data for each sample.

                Mike, I don't think it will work without being aligned because I believe that Picard works by looking at the mappings.

                Comment

                • kmcarr
                  Senior Member
                  • May 2008
                  • 1181

                  #9
                  Originally posted by mmuratet View Post
                  Greetings
                  I'm having the same problem. I used the command line argument to assume it was sorted but I'm getting screwy results. When the MarkDuplicates method says it wants 'coordinate sorted' data are they referring to tile-x-y or a genomic alignment? It seems one could find duplicates without reference to a genome. If it's tile-x-y then is it lexical or numeric?
                  Thanks
                  Mike
                  Coordinate sorted means sorted by their genomic alignment coordinates. Picard identifies duplicates as those reads mapping to the identical coordinates on the genome; obviously this task is made immensely easier if the alignments are already sorted.

                  Yes, you could find duplicates without reference to a genome. You would have to perform an all vs. all search which would require an huge amount of time and RAM when you are talking about tens or hundreds of million reads.

                  Comment

                  • thomasvangurp
                    Member
                    • Jan 2009
                    • 12

                    #10
                    I would like to use Picard duplicate removal also. However, i ran into some trouble using a SAM-file outputted by CLC-Bio Genomics workbench. Anyone had an idead how to fix this issue?

                    Code:
                    root@thomasg-desktop:/home/thomasg/Downloads/\tmp/picard-tools-1.27# java -jar MergeSamFiles.jar I=/home/thomasg/RF_7.fastq\ trimmed\ \(paired\)\ mapping\ \(11205\ references\).sam SO=coordinate AS=false O=/home/thomasg/out.sam
                    [Thu Aug 12 14:30:53 CEST 2010] net.sf.picard.sam.MergeSamFiles OUTPUT=/home/thomasg/out.sam SORT_ORDER=coordinate ASSUME_SORTED=false    MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false TMP_DIR=/tmp/root VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
                    INFO	2010-08-12 14:30:53	MergeSamFiles	Sorting input files using temp directory /tmp/root
                    [Thu Aug 12 14:30:53 CEST 2010] net.sf.picard.sam.MergeSamFiles done.
                    Runtime.totalMemory()=379322368
                    Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Paired read should be marked as first of pair or second of pair.; File /home/thomasg/RF_7.fastq trimmed (paired) mapping (11205 references).sam; Line 11208
                    Line: RF_43280	25	Contig_1	1	60	50M	*	0	0	ACAGCGACTCAACCAAAGGAATCCTATATAGAAATGCTATTAGGAATCCC	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	NH:i:1
                    	at net.sf.samtools.SAMTextReader.reportErrorParsingLine(SAMTextReader.java:220)
                    	at net.sf.samtools.SAMTextReader.access$500(SAMTextReader.java:40)
                    	at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:424)
                    	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:268)
                    	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:240)
                    	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:609)
                    	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:587)
                    	at net.sf.picard.util.PeekableIterator.advance(PeekableIterator.java:71)
                    	at net.sf.picard.util.PeekableIterator.<init>(PeekableIterator.java:41)
                    	at net.sf.picard.sam.ComparableSamRecordIterator.<init>(ComparableSamRecordIterator.java:51)
                    	at net.sf.picard.sam.MergingSamRecordIterator.addIterator(MergingSamRecordIterator.java:93)
                    	at net.sf.picard.sam.MergingSamRecordIterator.startIterationIfRequired(MergingSamRecordIterator.java:102)
                    	at net.sf.picard.sam.MergingSamRecordIterator.hasNext(MergingSamRecordIterator.java:117)
                    	at net.sf.picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:190)
                    	at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
                    	at net.sf.picard.sam.MergeSamFiles.main(MergeSamFiles.java:83)

                    Comment

                    • mmuratet
                      Member
                      • Jul 2008
                      • 13

                      #11
                      Picard duplicate removal problem

                      I had a similar problem with sam files derived from Illumina output. The problem was the mate IDs that Illumina uses, i.e., indexairN:filterFlag. I believe the tools expect pair IDs in the form /1 and /2. Check the output from the workbench to see how they identify pairs.

                      Comment

                      • scientifica
                        Junior Member
                        • Jan 2010
                        • 3

                        #12
                        Dear all,

                        For my sequencing project I would also like to remove duplicates. Did any of you already work with the CLC Assembly Cell to remove them?
                        I have no idea where to start.
                        Time is a great teacher. Unfortunately, it kills all its pupils.

                        Comment

                        • shanlan.mo
                          Junior Member
                          • Jan 2015
                          • 1

                          #13
                          Originally posted by cliff View Post
                          I tried again

                          java -Xmx2g -jar ~/picard-tools-1.21/MarkDuplicates.jar INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true

                          And I got this error:

                          [Sat Jun 12 22:11:22 EDT 2010] net.sf.picard.sam.MarkDuplicates INPUT=test_sorted.bam OUTPUT=test_rmdup.bam METRICS_FILE=PCR_duplicates REMOVE_DUPLICATES=true ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9][0-9]+)[0-9]+)[0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 TMP_DIR=/tmp/cliff VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000
                          INFO 2010-06-12 22:11:22 MarkDuplicates Start of doWork freeMemory: 31062256; totalMemory: 31588352; maxMemory: 1908932608
                          INFO 2010-06-12 22:11:22 MarkDuplicates Reading input file and constructing read end information.
                          INFO 2010-06-12 22:11:22 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
                          [Sat Jun 12 22:11:23 EDT 2010] net.sf.picard.sam.MarkDuplicates done.
                          Runtime.totalMemory()=152829952
                          Exception in thread "main" net.sf.picard.PicardException: test_sorted.bam is not coordinate sorted.
                          at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:250)
                          at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:112)
                          at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
                          at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:96)


                          It said "test_sorted.bam is not coordinate sorted.", but I got this test_sorted.bam after I used "samtools sort" actually...

                          where did I do wrong?..
                          the bam is sorted by Picardtools ,suchjava -jar $softwave/SamFormatConverter.jar I=$I/HFHm001_1_Tri.fastq_bismark_bt2_pe.sam o=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.bam
                          java -jar $softwave/SortSam.jar I=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.bam O=$O/HFHm001_1_Tri.fastq_bismark_bt2_pe.sorted.bam sort_order=coordinate

                          Comment

                          Latest Articles

                          Collapse

                          • GATTACAT
                            Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by GATTACAT
                            Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                            07-01-2026, 11:43 AM
                          • SEQadmin2
                            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by SEQadmin2


                            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                            Here are nine questions we think about, in roughly the order they matter, before...
                            06-18-2026, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 11:08 AM
                          0 responses
                          7 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-30-2026, 05:37 AM
                          0 responses
                          11 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-26-2026, 11:10 AM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-17-2026, 06:09 AM
                          0 responses
                          53 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...