Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Picard's MarkDuplicates -> OutOfMemoryError

    Hi folks,

    here comes my first question for you. I'm trying to remove duplicates from a big sorted merged BAM-file (~270 GB) with the help of Picard's MarkDuplicate function, but I'm running into OutOfMemoryErrors all the time. I'm kind of new to the real world sequencing industry and would appreciate any help you can give me.

    That's the command I'm using:

    Code:
    /usr/lib/jvm/java-1.6.0-ibm-1.6.0.8.x86_64/jre/bin/java -jar -Xmx40g /illumina/tools/picard-tools-1.45/MarkDuplicates.jar 
    INPUT=BL14_sorted_merged.bam 
    OUTPUT=BL14_sorted_merged_deduped.bam 
    METRICS_FILE=metrics.txt 
    REMOVE_DUPLICATES=true 
    ASSUME_SORTED=true 
    VALIDATION_STRINGENCY=LENIENT 
    TMP_DIR=/illumina/runs/temp/
    The ErrorMessag usually looks like this after running around 8 hours:

    Code:
    Exception in thread "main" java.lang.OutOfMemoryError
    at net.sf.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:101)
    at net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:443)
    at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:115)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:158)
    at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:97)
    The machine I'm running it on has 48275 MB RAM and 2000 MB Swap.

    Please tell me, if you need mor info, if I'm doing something completley wrong or the amount of memory just isn't enough to get a result or whatever. Thanks in advance.

  • #2
    It seems I've finally found a working set of arguments! After more than 14 hours it's still running! Fingers crossed, it keeps doing so and finishs successfully eventually.

    Comment


    • #3
      Do you mind posting your working set of arguments? I'm in a very similar situation with this error.

      Comment


      • #4
        Originally posted by oiiio View Post
        Do you mind posting your working set of arguments? I'm in a very similar situation with this error.
        Sorry, I had forgotten about posting my solution here. It solved a similar problem for a guy on the samtools/picard mailing list already:

        [Samtools-help] Picard MarkDuplicates memory error on very large file


        In short: Less heap makes Picard more stable. Xmx4g seems optimal.

        Comment


        • #5
          Hi, I am having a lot of trouble w MarkDuplicates on some of my bam files. It was throwing the same error as shown in this forum:

          Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
          I have tried the following with no success:
          1. -Xmx2g (this is the most that my cluster is allowing me for some reason) : this allowed the program to run longer but still throws the same error
          2. MAX_RECORDS_IN_RAM=5000000: this gave me a different error (below)

          Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
          at net.sf.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
          at net.sf.samtools.BAMRecord.decodeAttributes(BAMRecord.java:308)
          at net.sf.samtools.BAMRecord.getAttribute(BAMRecord.java:288)
          at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1601)
          at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:540)
          at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
          at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481)
          at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:672)
          at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:650)
          at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:386)
          at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150)
          at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
          at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134)
          I don't really know where to go from here? Has anyone else had the above error thrown and been able to solve it?

          Comment


          • #6
            Originally posted by dGho View Post
            Hi, I am having a lot of trouble w MarkDuplicates on some of my bam files. It was throwing the same error as shown in this forum:



            I have tried the following with no success:
            1. -Xmx2g (this is the most that my cluster is allowing me for some reason) : this allowed the program to run longer but still throws the same error
            2. MAX_RECORDS_IN_RAM=5000000: this gave me a different error (below)



            I don't really know where to go from here? Has anyone else had the above error thrown and been able to solve it?
            And to add insult to injury, I have attempted to add -XX:-UseGCOverheadLimit to my command, which now looks like this:

            java -Xmx2g -XX:-UseGCOverheadLimit -jar /usr/local/picard/1.84/MarkDuplicates.jar INPUT="$f1"a1.clean.bam OUTPUT="$f1"a1.ddup.bam METRICS_FILE="$f1"a1.ddup.metrics REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/scratch/apaciork_group/tmp TMP_DIR=/scratch/dghoneim/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=5000000
            and now I am getting the original error again!
            Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at java.util.ArrayList.<init>(Unknown Source)
            at java.util.ArrayList.<init>(Unknown Source)
            at net.sf.samtools.SAMRecord.getAlignmentBlocks(SAMRecord.java:1370)
            at net.sf.samtools.SAMRecord.validateCigar(SAMRecord.java:1413)
            at net.sf.samtools.BAMRecord.getCigar(BAMRecord.java:247)
            at net.sf.samtools.SAMRecord.getUnclippedStart(SAMRecord.java:472)
            at net.sf.picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:463)
            at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:402)
            at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150)
            at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
            at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134)
            hmmm...I am going in circles...anyone have a clue what is going on?

            Comment


            • #7
              Are you sure you are running 64-bit java (wonder if that is the reason it is only allowing you to allocate 2G to the heap space)? Both 32-bit and 64-bit java may be installed on your cluster.

              Can you post the output of

              Code:
              $ java -version

              Comment


              • #8
                Thank you so much Geno
                I am using java 7


                java version "1.7.0_11"
                Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
                Java HotSpot(TM) Server VM (build 23.6-b04, mixed mode)

                Comment


                • #9
                  Originally posted by dGho View Post
                  Thank you so much Geno
                  I am using java 7


                  java version "1.7.0_11"
                  Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
                  Java HotSpot(TM) Server VM (build 23.6-b04, mixed mode)

                  Can you check the following to see if you get an error?

                  Code:
                  $ java -d64 -version

                  Comment


                  • #10
                    Error: This Java instance does not support a 64-bit JVM.

                    is what I get. So I guess I am running 32 bit. Could this be my problem?
                    I am a little confused bc I don't have trouble running MarkDuplicates on my old bam files until now, just our most recent ones.

                    Comment


                    • #11
                      Originally posted by dGho View Post
                      Error: This Java instance does not support a 64-bit JVM.

                      is what I get. So I guess I am running 32 bit. Could this be my problem?
                      I am a little confused bc I don't have trouble running MarkDuplicates on my old bam files until now, just our most recent ones.
                      You are running 32-bit java. That explains why you have not been able to allocate more heap memory.

                      Can you look around to see if there is 64-bit version of Java available on your cluster?

                      Are these BAM files larger than previous one?

                      Comment


                      • #12
                        Yes, these BAM files are slightly larger. I will see if I can use 64-bit java on our cluster...thank you so much Geno for you suggestion!

                        Comment


                        • #13
                          So, I tried using 64bit java and using the -Xmx4g option. This allowed markduplicates to run longer (72min) and then ran out of memory again. any thoughts?

                          Comment


                          • #14
                            Are you sure the process ran out of RAM or did it run out of temp space on disk? How big is the BAM file?

                            Comment


                            • #15
                              Originally posted by GenoMax View Post
                              Are you sure the process ran out of RAM or did it run out of temp space on disk? How big is the BAM file?
                              Thank you Geno, so I guess it was just a problem w RAM. I am working on a cluster that has "unlimited" space on the disk, so I was pretty sure that was not the problem. I wanted to post the solution that worked for me. -Xmx4g was not enough for the data set I am working on, although 2G had been enough for all the past exomes.

                              in my case the solution was:
                              I used -Xmx8g and that ran fine...so I guess -Xmx4g is not always optimal. Thank you Geno for all your help and 64bit Java was definitely the way to go.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X