View Single Post
Old 10-08-2012, 04:13 AM   #1
Location: Liverpool, UK

Join Date: Feb 2011
Posts: 30
Question Cufflinks timing out - computing power required?

I am analysing human transcriptome data (Illumina) via the Tophat -> Cufflinks pipeline (v2.0.2) using iGenomes references. My dataset comprises 14 patients and 6 controls, so I have 2 "conditions" to analyse with 14 and 6 biological replicates respectively.

Until now I have been bypassing the full cufflinks protocol and just running cuffdiff providing a GTF, as follows:

PHP Code:
cuffdiff -p 8 -./cuffdiff_out -b genome.fa genes.gtf P1.bam,P2.bam,P3.bam,P4.bam,P5.bam,P6.bam,P7.bam,P8.bam,P9.bam,P10.bam,P11.bam,P12.bam,P13.bam,P14.bam C1.bam,C2.bam,C3.bam,C4.bam,C5.bam,C6.bam 
This operation runs across 8 cores of our server (4GB per core) in 11-12h.

However, I have been trying to run the full cufflinks -> cuffmerge -> cuffdiff protocol (as per the Nature Protocols publication) but as yet have not been able to successfully complete the entire process. My IT support team have been very helpful but the final cuffdiff job which I run is requiring HUGE amounts of computing power and time and I wonder what other people's experience of this is are or if I am doing something wrong.

I have successfully run these operations:-

Cufflinks for each BAM file:
PHP Code:
cufflinks -p 8 -./output_dir -b genome.fa -g genes.gtf P1.bam 
Then create assemblies.txt file:-
PHP Code:
Cuffmerge (this took 1h):
PHP Code:
cuffmerge -p 8 -./cuffmerge_out -g genes.gtf -s genome.fa assemblies.txt 
PHP Code:
cuffdiff -p 8 -./cuffdiff_out -b genome.fa -u merged.gtf P1.bam,P2.bam,P3.bam,P4.bam,P5.bam,P6.bam,P7.bam,P8.bam,P9.bam,P10.bam,P11.bam,P12.bam,P13.bam,P14.bam C1.bam,C2.bam,C3.bam,C4.bam,C5.bam,C6.bam 
The last time I tried to run the cuffdiff step I was allocated 160GB across 8 cores for 5 days. The job timed out at the "Testing for differential expression and regulation in locus" step. It also only ever used ~30GB of the 160GB allocated.

Can anyone offer any advice / suggestions / or even let me know how much computing power / time they use for their runs?

Much appreciated
hlwright is offline   Reply With Quote