Seqanswers Leaderboard Ad

**raonyguimaraes** · 11-04-2011, 01:13 AM

No it shoudln't take that long ... but this is proportional to your computer's capabilities ... What is your config (specially processors)? Are you using threads in GATK ? How many ?

**alexbmp** · 11-04-2011, 01:44 AM

Originally posted by raonyguimaraes View Post

No it shoudln't take that long ... but this is proportional to your computer's capabilities ... What is your config (specially processors)? Are you using threads in GATK ? How many ?

model name : AMD Opteron(tm) Processor 6172
cpu MHz : 2100.034

I think samtools sort takes so long probably because I divided my .bam alignment output into 500 pieces... (before alignment and the output is still divided)

There are 48 cores; I gave -nt value 24, but it seems to utilize about 6 cores.

**adaptivegenome** · 11-04-2011, 04:26 PM

We use similar machines and we find BWA scales perfectly, in that 48 cores enable mapping nearly 48 times faster than 1 core. We load data directly to the computer's internal RAID because otherwise the I/O is unbearably slow. We regularly map human genomes and memory has not been a bottleneck for us. We don't divide our SAM files. Are you making SAM files and then converting to BAM? If so, that can be sped up by directly streaming the mapper to samtools for conversion to BAM...

**alexbmp** · 11-04-2011, 05:19 PM

@genericforms

You mean 1) mappers like bwa, gsnap etc. can stream its output to samtools,

or 2) we could add another line after bwa aln, samse/pe to stream its output to samtools?

It really is a pain treating divided files...

Simple merging by picard takes more than 6 hours.

**adaptivegenome** · 11-04-2011, 05:24 PM

Yes, I meant #1. This saves on having to first write a SAM file and then write it again as a BAM file. I don't think you should split the files and again I would recommend running on a local drive.

**adaptivegenome** · 11-04-2011, 05:36 PM

It would be something like this:
bwa sampe [parameters] | samtools view -bt [reference] > out.bam

**alexbmp** · 11-04-2011, 09:41 PM

Thank you all

I'm trying it the other way.

The splicing seems to have taken most of the time.

**adaptivegenome** · 11-05-2011, 05:18 PM

Good luck. Post an update on whether things work better for you.

**alexbmp** · 11-10-2011, 12:24 AM

Not using splitted files really make the speed.

Still, it seems too slow to follow the pipeline GATK suggests...

I've changed my pipeline a little bit.

I reckoned that almost all the time indices are required.

My process stopped when MarkDuplicates.jar make an error:
(Any ideas about this error?)

Code:

[Thu Nov 10 15:08:17 KST 2011] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 328.99 minutes.
Runtime.totalMemory()=3538878464
Exception in thread "main" net.sf.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
        at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:228)
        at net.sf.samtools.util.SortingCollection.add(SortingCollection.java:150)
        at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:290)
        at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:117)
        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
        at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:101)
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.write(Native Method)
        at org.xerial.snappy.SnappyOutputStream.writeInt(SnappyOutputStream.java:105)
        at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:126)
        at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100)
        at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137)
        at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:219)
        ... 5 more

1. Sorting the BAM format. [samtools]
~12 h.

2. Adding read groups which I didn't include in the SAM files. [picard AddOrReplaceReadGroups.jar]
~6 h.

3. Indexing BAM output. [samtools]
~32 min.

4. Reordering BAM; my reads didn't seem to be in karyotypic order. [picard ReorderSam.jar]
~6.5 h.

5. Indexing BAM output. [samtools]
~24 min.

*STOPPED HERE*6. Marking plausible PCR duplicates. [picard MarkDuplicates.jar]
~5.5 h.

7. Indexing BAM output. [samtools]

8. Create suspicious intervals for realignment. [GATK RealignTargetCreator]

9. Realign to remove false-positive SNPs and correct for false-negative indels - if this is the right explanation. [GATK IndelRealigner]

10. Indexing BAM output. [samtools]

11. Recalibrating base quality [GATK CountCovariates, TableRecalibration]

12. Indexing BAM output. [samtools]

13. Calling SNPs and indels [-glm BOTH] with GATK Bayesian caller. [GATK UnifiedGenotyper.jar]
~3.2 hours for my toy; ~1/500 of my real sample size.

14. Filter SNVs [GATK VariantFiltration]

I'm looking for a way to solve the error given by picard.

Anyways, I wanted to update my work here, so others don't get in to the same trouble I did.

If you have any ideas about my plan or my question, I'd be most welcome.

**vang42** · 11-10-2011, 01:27 AM

Originally posted by alexbmp View Post

My process stopped when MarkDuplicates.jar make an error:
(Any ideas about this error?)

Have you set TMP_DIR to a place with enough disk space. You can set it using something lijke this: TMP_DIR=`pwd`/tmp or whatever you like.

**alexbmp** · 11-10-2011, 01:41 AM

@ vang42:
Actually, I haven't. That really is a good point! Thank you!

I'll try it on and update if it works

**alexbmp** · 11-11-2011, 08:57 PM

Well, it works! Thank you all!

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

GATK pipeline runtime

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News