SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Error with MarkDuplicates in Picard slowsmile Bioinformatics 13 11-01-2015 03:16 AM
How to use Picard's MarkDuplicates cliff Bioinformatics 12 01-26-2015 10:56 PM
MarkDuplicates in picard bair Bioinformatics 3 12-23-2010 11:00 AM
picard markduplicates on huge files rcorbett Bioinformatics 2 09-17-2010 04:39 AM
Picard MarkDuplicates wangzkai Bioinformatics 2 05-18-2010 09:14 PM

Reply
 
Thread Tools
Old 05-25-2011, 08:45 AM   #1
elgor
Junior Member
 
Location: Heidelberg

Join Date: May 2011
Posts: 8
Default Picard's MarkDuplicates -> OutOfMemoryError

Hi folks,

here comes my first question for you. I'm trying to remove duplicates from a big sorted merged BAM-file (~270 GB) with the help of Picard's MarkDuplicate function, but I'm running into OutOfMemoryErrors all the time. I'm kind of new to the real world sequencing industry and would appreciate any help you can give me.

That's the command I'm using:

Code:
/usr/lib/jvm/java-1.6.0-ibm-1.6.0.8.x86_64/jre/bin/java -jar -Xmx40g /illumina/tools/picard-tools-1.45/MarkDuplicates.jar 
INPUT=BL14_sorted_merged.bam 
OUTPUT=BL14_sorted_merged_deduped.bam 
METRICS_FILE=metrics.txt 
REMOVE_DUPLICATES=true 
ASSUME_SORTED=true 
VALIDATION_STRINGENCY=LENIENT 
TMP_DIR=/illumina/runs/temp/
The ErrorMessag usually looks like this after running around 8 hours:

Code:
Exception in thread "main" java.lang.OutOfMemoryError
at net.sf.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:101)
at net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:443)
at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:115)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:158)
at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:97)
The machine I'm running it on has 48275 MB RAM and 2000 MB Swap.

Please tell me, if you need mor info, if I'm doing something completley wrong or the amount of memory just isn't enough to get a result or whatever. Thanks in advance.
elgor is offline   Reply With Quote
Old 05-26-2011, 12:56 AM   #2
elgor
Junior Member
 
Location: Heidelberg

Join Date: May 2011
Posts: 8
Default

It seems I've finally found a working set of arguments! After more than 14 hours it's still running! Fingers crossed, it keeps doing so and finishs successfully eventually.
elgor is offline   Reply With Quote
Old 06-06-2011, 09:03 AM   #3
oiiio
Senior Member
 
Location: USA

Join Date: Jan 2011
Posts: 105
Default

Do you mind posting your working set of arguments? I'm in a very similar situation with this error.
oiiio is offline   Reply With Quote
Old 06-07-2011, 12:15 AM   #4
elgor
Junior Member
 
Location: Heidelberg

Join Date: May 2011
Posts: 8
Default

Quote:
Originally Posted by oiiio View Post
Do you mind posting your working set of arguments? I'm in a very similar situation with this error.
Sorry, I had forgotten about posting my solution here. It solved a similar problem for a guy on the samtools/picard mailing list already:

[Samtools-help] Picard MarkDuplicates memory error on very large file


In short: Less heap makes Picard more stable. Xmx4g seems optimal.
elgor is offline   Reply With Quote
Old 07-18-2013, 10:54 AM   #5
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default

Hi, I am having a lot of trouble w MarkDuplicates on some of my bam files. It was throwing the same error as shown in this forum:

Quote:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I have tried the following with no success:
1. -Xmx2g (this is the most that my cluster is allowing me for some reason) : this allowed the program to run longer but still throws the same error
2. MAX_RECORDS_IN_RAM=5000000: this gave me a different error (below)

Quote:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at net.sf.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
at net.sf.samtools.BAMRecord.decodeAttributes(BAMRecord.java:308)
at net.sf.samtools.BAMRecord.getAttribute(BAMRecord.java:288)
at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1601)
at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:540)
at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:672)
at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:650)
at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:386)
at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134)
I don't really know where to go from here? Has anyone else had the above error thrown and been able to solve it?
dGho is offline   Reply With Quote
Old 07-18-2013, 11:22 AM   #6
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Talking

Quote:
Originally Posted by dGho View Post
Hi, I am having a lot of trouble w MarkDuplicates on some of my bam files. It was throwing the same error as shown in this forum:



I have tried the following with no success:
1. -Xmx2g (this is the most that my cluster is allowing me for some reason) : this allowed the program to run longer but still throws the same error
2. MAX_RECORDS_IN_RAM=5000000: this gave me a different error (below)



I don't really know where to go from here? Has anyone else had the above error thrown and been able to solve it?
And to add insult to injury, I have attempted to add -XX:-UseGCOverheadLimit to my command, which now looks like this:

Quote:
java -Xmx2g -XX:-UseGCOverheadLimit -jar /usr/local/picard/1.84/MarkDuplicates.jar INPUT="$f1"a1.clean.bam OUTPUT="$f1"a1.ddup.bam METRICS_FILE="$f1"a1.ddup.metrics REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/scratch/apaciork_group/tmp TMP_DIR=/scratch/dghoneim/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=5000000
and now I am getting the original error again!
Quote:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.ArrayList.<init>(Unknown Source)
at java.util.ArrayList.<init>(Unknown Source)
at net.sf.samtools.SAMRecord.getAlignmentBlocks(SAMRecord.java:1370)
at net.sf.samtools.SAMRecord.validateCigar(SAMRecord.java:1413)
at net.sf.samtools.BAMRecord.getCigar(BAMRecord.java:247)
at net.sf.samtools.SAMRecord.getUnclippedStart(SAMRecord.java:472)
at net.sf.picard.sam.MarkDuplicates.buildReadEnds(MarkDuplicates.java:463)
at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:402)
at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134)
hmmm...I am going in circles...anyone have a clue what is going on?
dGho is offline   Reply With Quote
Old 07-18-2013, 11:36 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Are you sure you are running 64-bit java (wonder if that is the reason it is only allowing you to allocate 2G to the heap space)? Both 32-bit and 64-bit java may be installed on your cluster.

Can you post the output of

Code:
$ java -version
GenoMax is offline   Reply With Quote
Old 07-18-2013, 11:43 AM   #8
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default

Thank you so much Geno
I am using java 7


java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) Server VM (build 23.6-b04, mixed mode)
dGho is offline   Reply With Quote
Old 07-18-2013, 11:51 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Quote:
Originally Posted by dGho View Post
Thank you so much Geno
I am using java 7


java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) Server VM (build 23.6-b04, mixed mode)

Can you check the following to see if you get an error?

Code:
$ java -d64 -version
GenoMax is offline   Reply With Quote
Old 07-18-2013, 12:15 PM   #10
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default

Error: This Java instance does not support a 64-bit JVM.

is what I get. So I guess I am running 32 bit. Could this be my problem?
I am a little confused bc I don't have trouble running MarkDuplicates on my old bam files until now, just our most recent ones.
dGho is offline   Reply With Quote
Old 07-18-2013, 12:18 PM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Quote:
Originally Posted by dGho View Post
Error: This Java instance does not support a 64-bit JVM.

is what I get. So I guess I am running 32 bit. Could this be my problem?
I am a little confused bc I don't have trouble running MarkDuplicates on my old bam files until now, just our most recent ones.
You are running 32-bit java. That explains why you have not been able to allocate more heap memory.

Can you look around to see if there is 64-bit version of Java available on your cluster?

Are these BAM files larger than previous one?
GenoMax is offline   Reply With Quote
Old 07-18-2013, 12:25 PM   #12
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default

Yes, these BAM files are slightly larger. I will see if I can use 64-bit java on our cluster...thank you so much Geno for you suggestion!
dGho is offline   Reply With Quote
Old 07-31-2013, 06:33 AM   #13
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default

So, I tried using 64bit java and using the -Xmx4g option. This allowed markduplicates to run longer (72min) and then ran out of memory again. any thoughts?
dGho is offline   Reply With Quote
Old 07-31-2013, 09:09 PM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Are you sure the process ran out of RAM or did it run out of temp space on disk? How big is the BAM file?
GenoMax is offline   Reply With Quote
Old 08-01-2013, 05:26 AM   #15
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default

Quote:
Originally Posted by GenoMax View Post
Are you sure the process ran out of RAM or did it run out of temp space on disk? How big is the BAM file?
Thank you Geno, so I guess it was just a problem w RAM. I am working on a cluster that has "unlimited" space on the disk, so I was pretty sure that was not the problem. I wanted to post the solution that worked for me. -Xmx4g was not enough for the data set I am working on, although 2G had been enough for all the past exomes.

in my case the solution was:
I used -Xmx8g and that ran fine...so I guess -Xmx4g is not always optimal. Thank you Geno for all your help and 64bit Java was definitely the way to go.
dGho is offline   Reply With Quote
Old 08-05-2013, 06:37 AM   #16
gt1
Junior Member
 
Location: Cambridge, UK

Join Date: Jul 2013
Posts: 9
Default

The biobambam package contains a tool called bammarkduplicates. It produces results which should be quite similar to those of Picard's MarkDuplicates tool while avoiding the sometimes high memory requirements of the Java implementation. The source code is available on github at https://github.com/gt1/biobambam, binaries for some versions of Linux are at ftp://ftp.sanger.ac.uk/pub/users/gt1/biobambam and also on launchpad https://launchpad.net/biobambam . It was developed at the Sanger Institute because the Java tool failed with out of memory type errors for a number of (at least locally high depth) BAM files, which required manual intervention (rerunning compute jobs with higher memory settings, which blocks otherwise free CPU cores because a single one is using all the RAM). If someone is interested in the algorithmic background, there is a preprint at http://arxiv.org/abs/1306.0836 .
gt1 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:28 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO