SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
running GATK Queue script under PBS environment? orionzhou Bioinformatics 1 12-09-2012 05:24 AM
Error GATK CountCovariates - v.1.2.65 m_elena_bioinfo Bioinformatics 14 08-16-2012 10:36 AM
Problem GATK with CountCovariates amathieu Bioinformatics 7 02-27-2012 01:28 AM
Missing dinucleotides after GATK CountCovariates aldo Bioinformatics 0 11-23-2011 07:32 AM
samtools sort running extremely slow tsucheta Bioinformatics 2 06-11-2010 07:30 AM

Reply
 
Thread Tools
Old 06-28-2011, 09:09 AM   #1
indapa
Junior Member
 
Location: United States

Join Date: May 2011
Posts: 5
Default GATK CountCovariates running very slow

Hi,

I tried posting this question on GetStatisfaction GATK forum but kept getting an invalid request error in Firefox. I thought I would give SeqAnswers a try (this is my first post here)

I am trying to recalibrate quality scores with GATK CountCovariates and it is running extremely slow:

java -Xmx64000m -jar GenomeAnalysisTK.jar -R $REF_BIN/$REF --DBSNP
$DBSNP_BIN/$DBSNP -l INFO -T CountCovariates -I my.bam
--max_reads_at_locus 20000 -cov ReadGroupCovariate -cov
QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate
-recalFile $CSV > $CSV.stdout 2> $NODE_DIR/$OUTPUT.stderr

Initially GATK gives an EOF exception for reading a *.rod.idx file

INFO 08:51:15,032 TribbleRMDTrackBuilder - Loading Tribble index from
disk for file /scratch/indapa/dbsnp_129_b37.rod
ERROR 08:51:19,710 LinearIndex - Error reading index file:
/scratch/indapa/dbsnp_129_b37.rod.idx
java.io.EOFException

But then proceeds to the CovariateCounterWalker and starts recording
the number sites traversed (the bam file I want to recalibrate has ~150M reads and is 11GB in size)

INFO 08:59:30,757 CovariateCounterWalker - The covariates being used here:
INFO 08:59:30,758 CovariateCounterWalker - ReadGroupCovariate
INFO 08:59:30,758 CovariateCounterWalker - QualityScoreCovariate
INFO 08:59:30,758 CovariateCounterWalker - CycleCovariate
INFO 08:59:30,759 CovariateCounterWalker - DinucCovariate
INFO 09:00:25,452 TraversalEngine - [PROGRESS] Traversed to 1:10001,
processing 1 sites in 545.65 secs (545645000.00 secs per 1M sites)

It has been traversing human chromosome 1 for >2days. I was initially
getting out of memory exception and I allocated much more memory to
the java heap than I had done in the past. I'm not sure why this is taking so much longer than previous bam files I've recalibrated with GATK of similar file size. Has anyone experienced similar behavior with CounCovariates?
indapa is offline   Reply With Quote
Old 06-30-2011, 06:46 AM   #2
indapa
Junior Member
 
Location: United States

Join Date: May 2011
Posts: 5
Default

figured it out - the rod file index was corrupted. Downloaded new verison of GATK along with resource bundle: http://www.broadinstitute.org/gsa/wi...esource_bundle with dbSNP vcf and it works much better.
indapa is offline   Reply With Quote
Reply

Tags
base quality, gatk, recalibration

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:48 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO