SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cufflinks Runtime ksiowa Bioinformatics 10 04-27-2012 06:52 AM
Galaxy workflow for GATK pipeline [Work in progress] Carlos Borroto Bioinformatics 3 11-30-2011 08:41 AM
Approximate botwie runtime BioSlayer Bioinformatics 6 05-09-2011 06:49 AM
Euler-SR de novo assembly error during runtime allenyu Bioinformatics 10 08-31-2009 02:12 PM

Reply
 
Thread Tools
Old 11-03-2011, 11:30 PM   #1
alexbmp
Member
 
Location: Seoul, South Korea

Join Date: Oct 2011
Posts: 30
Default GATK pipeline runtime

I wonder if I'm doing the things right; it takes so long.

I'm going to find SNPs for about 100 giga bps for the human genome.

After alignment, I processed my reads with samtools, picard and GATK.


1. Converting (divided for alignment efficiency) SAM to BAM format. [samtools]
~500 min.

2. Sorting the BAM format. [samtools]
~12 h.

3. Merging divided BAM formats to 1 BAM format per sample-run. [picard MergeSamFiles.jar]

4. Adding read groups which I didn't include in the SAM files. [picard AddOrReplaceReadGroups.jar]

5. Indexing sorted & merged BAM output. [samtools]

6. Create suspicious intervals for realignment. [GATK RealignTargetCreator]

7. Realign to remove false-positive SNPs and correct for false-negative indels - if this is the right explanation. [GATK IndelRealigner]

8. Marking plausible PCR duplicates. [picard MarkDuplicates.jar]

9. Re-indexing sorted-merged-realigned-deduplicated BAM. [samtools]

10. Recalibrating base quality [GATK CountCovariates, TableRecalibration]

11. Re-indexing sorted-merged-realigned-deduplicated-recalibrated BAM. [samtools]

12. Calling SNPs and indels [-glm BOTH] with GATK Bayesian caller. [GATK UnifiedGenotyper.jar]
~3.2 hours for my toy; ~1/500 of my real sample size.

13. Filter SNVs ## I haven't reached here yet.


Is it normal to take this long for pre-processing map/alignment results?

Before I go into troubleshooting I came here to ask for comments.

Hope you guys give me some comments.

Have a great day!!

Last edited by alexbmp; 11-03-2011 at 11:33 PM.
alexbmp is offline   Reply With Quote
Old 11-04-2011, 02:13 AM   #2
raonyguimaraes
Member
 
Location: Belo Horizonte - Brazil

Join Date: Jun 2010
Posts: 38
Default

No it shoudln't take that long ... but this is proportional to your computer's capabilities ... What is your config (specially processors)? Are you using threads in GATK ? How many ?
raonyguimaraes is offline   Reply With Quote
Old 11-04-2011, 02:44 AM   #3
alexbmp
Member
 
Location: Seoul, South Korea

Join Date: Oct 2011
Posts: 30
Default

Quote:
Originally Posted by raonyguimaraes View Post
No it shoudln't take that long ... but this is proportional to your computer's capabilities ... What is your config (specially processors)? Are you using threads in GATK ? How many ?
model name : AMD Opteron(tm) Processor 6172
cpu MHz : 2100.034

I think samtools sort takes so long probably because I divided my .bam alignment output into 500 pieces... (before alignment and the output is still divided)

There are 48 cores; I gave -nt value 24, but it seems to utilize about 6 cores.
alexbmp is offline   Reply With Quote
Old 11-04-2011, 05:26 PM   #4
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

We use similar machines and we find BWA scales perfectly, in that 48 cores enable mapping nearly 48 times faster than 1 core. We load data directly to the computer's internal RAID because otherwise the I/O is unbearably slow. We regularly map human genomes and memory has not been a bottleneck for us. We don't divide our SAM files. Are you making SAM files and then converting to BAM? If so, that can be sped up by directly streaming the mapper to samtools for conversion to BAM...

Last edited by adaptivegenome; 11-04-2011 at 05:26 PM. Reason: typos
adaptivegenome is offline   Reply With Quote
Old 11-04-2011, 06:19 PM   #5
alexbmp
Member
 
Location: Seoul, South Korea

Join Date: Oct 2011
Posts: 30
Default

@genericforms

You mean 1) mappers like bwa, gsnap etc. can stream its output to samtools,

or 2) we could add another line after bwa aln, samse/pe to stream its output to samtools?

It really is a pain treating divided files...

Simple merging by picard takes more than 6 hours.
alexbmp is offline   Reply With Quote
Old 11-04-2011, 06:24 PM   #6
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Yes, I meant #1. This saves on having to first write a SAM file and then write it again as a BAM file. I don't think you should split the files and again I would recommend running on a local drive.
adaptivegenome is offline   Reply With Quote
Old 11-04-2011, 06:36 PM   #7
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

It would be something like this:
bwa sampe [parameters] | samtools view -bt [reference] > out.bam
adaptivegenome is offline   Reply With Quote
Old 11-04-2011, 10:41 PM   #8
alexbmp
Member
 
Location: Seoul, South Korea

Join Date: Oct 2011
Posts: 30
Default

Thank you all I'm trying it the other way.

The splicing seems to have taken most of the time.
alexbmp is offline   Reply With Quote
Old 11-05-2011, 06:18 PM   #9
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Good luck. Post an update on whether things work better for you.
adaptivegenome is offline   Reply With Quote
Old 11-10-2011, 12:24 AM   #10
alexbmp
Member
 
Location: Seoul, South Korea

Join Date: Oct 2011
Posts: 30
Default

Not using splitted files really make the speed.

Still, it seems too slow to follow the pipeline GATK suggests...

I've changed my pipeline a little bit.

I reckoned that almost all the time indices are required.

My process stopped when MarkDuplicates.jar make an error:
(Any ideas about this error?)

Code:
[Thu Nov 10 15:08:17 KST 2011] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 328.99 minutes.
Runtime.totalMemory()=3538878464
Exception in thread "main" net.sf.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
        at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:228)
        at net.sf.samtools.util.SortingCollection.add(SortingCollection.java:150)
        at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:290)
        at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:117)
        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
        at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:101)
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.write(Native Method)
        at org.xerial.snappy.SnappyOutputStream.writeInt(SnappyOutputStream.java:105)
        at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:126)
        at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100)
        at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137)
        at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:219)
        ... 5 more
1. Sorting the BAM format. [samtools]
~12 h.

2. Adding read groups which I didn't include in the SAM files. [picard AddOrReplaceReadGroups.jar]
~6 h.

3. Indexing BAM output. [samtools]
~32 min.

4. Reordering BAM; my reads didn't seem to be in karyotypic order. [picard ReorderSam.jar]
~6.5 h.

5. Indexing BAM output. [samtools]
~24 min.

*STOPPED HERE*6. Marking plausible PCR duplicates. [picard MarkDuplicates.jar]
~5.5 h.

7. Indexing BAM output. [samtools]

8. Create suspicious intervals for realignment. [GATK RealignTargetCreator]

9. Realign to remove false-positive SNPs and correct for false-negative indels - if this is the right explanation. [GATK IndelRealigner]

10. Indexing BAM output. [samtools]

11. Recalibrating base quality [GATK CountCovariates, TableRecalibration]

12. Indexing BAM output. [samtools]

13. Calling SNPs and indels [-glm BOTH] with GATK Bayesian caller. [GATK UnifiedGenotyper.jar]
~3.2 hours for my toy; ~1/500 of my real sample size.

14. Filter SNVs [GATK VariantFiltration]


I'm looking for a way to solve the error given by picard.

Anyways, I wanted to update my work here, so others don't get in to the same trouble I did.

If you have any ideas about my plan or my question, I'd be most welcome.

Last edited by alexbmp; 11-10-2011 at 04:38 AM.
alexbmp is offline   Reply With Quote
Old 11-10-2011, 01:27 AM   #11
vang42
Member
 
Location: Denmark

Join Date: Mar 2010
Posts: 10
Default

Quote:
Originally Posted by alexbmp View Post
My process stopped when MarkDuplicates.jar make an error:
(Any ideas about this error?)
Have you set TMP_DIR to a place with enough disk space. You can set it using something lijke this: TMP_DIR=`pwd`/tmp or whatever you like.
vang42 is offline   Reply With Quote
Old 11-10-2011, 01:41 AM   #12
alexbmp
Member
 
Location: Seoul, South Korea

Join Date: Oct 2011
Posts: 30
Default

@ vang42:
Actually, I haven't. That really is a good point! Thank you!

I'll try it on and update if it works
alexbmp is offline   Reply With Quote
Old 11-11-2011, 08:57 PM   #13
alexbmp
Member
 
Location: Seoul, South Korea

Join Date: Oct 2011
Posts: 30
Default

Well, it works! Thank you all!
alexbmp is offline   Reply With Quote
Reply

Tags
gatk, picard, preprocessing, runtime, samtools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO