I am analysing human transcriptome data (Illumina) via the Tophat -> Cufflinks pipeline (v2.0.2) using iGenomes references. My dataset comprises 14 patients and 6 controls, so I have 2 "conditions" to analyse with 14 and 6 biological replicates respectively.
Until now I have been bypassing the full cufflinks protocol and just running cuffdiff providing a GTF, as follows:
This operation runs across 8 cores of our server (4GB per core) in 11-12h.
However, I have been trying to run the full cufflinks -> cuffmerge -> cuffdiff protocol (as per the Nature Protocols publication) but as yet have not been able to successfully complete the entire process. My IT support team have been very helpful but the final cuffdiff job which I run is requiring HUGE amounts of computing power and time and I wonder what other people's experience of this is are or if I am doing something wrong.
I have successfully run these operations:-
Cufflinks for each BAM file:
Then create assemblies.txt file:-
Cuffmerge (this took 1h):
Cuffdiff:
The last time I tried to run the cuffdiff step I was allocated 160GB across 8 cores for 5 days. The job timed out at the "Testing for differential expression and regulation in locus" step. It also only ever used ~30GB of the 160GB allocated.
Can anyone offer any advice / suggestions / or even let me know how much computing power / time they use for their runs?
Much appreciated
Helen
Until now I have been bypassing the full cufflinks protocol and just running cuffdiff providing a GTF, as follows:
PHP Code:
cuffdiff -p 8 -o ./cuffdiff_out -b genome.fa genes.gtf P1.bam,P2.bam,P3.bam,P4.bam,P5.bam,P6.bam,P7.bam,P8.bam,P9.bam,P10.bam,P11.bam,P12.bam,P13.bam,P14.bam C1.bam,C2.bam,C3.bam,C4.bam,C5.bam,C6.bam
However, I have been trying to run the full cufflinks -> cuffmerge -> cuffdiff protocol (as per the Nature Protocols publication) but as yet have not been able to successfully complete the entire process. My IT support team have been very helpful but the final cuffdiff job which I run is requiring HUGE amounts of computing power and time and I wonder what other people's experience of this is are or if I am doing something wrong.
I have successfully run these operations:-
Cufflinks for each BAM file:
PHP Code:
cufflinks -p 8 -o ./output_dir -b genome.fa -g genes.gtf P1.bam
PHP Code:
./path/to/P1.bam
./path/to/P2.bam
...
etc
PHP Code:
cuffmerge -p 8 -o ./cuffmerge_out -g genes.gtf -s genome.fa assemblies.txt
PHP Code:
cuffdiff -p 8 -o ./cuffdiff_out -b genome.fa -u merged.gtf P1.bam,P2.bam,P3.bam,P4.bam,P5.bam,P6.bam,P7.bam,P8.bam,P9.bam,P10.bam,P11.bam,P12.bam,P13.bam,P14.bam C1.bam,C2.bam,C3.bam,C4.bam,C5.bam,C6.bam
Can anyone offer any advice / suggestions / or even let me know how much computing power / time they use for their runs?
Much appreciated
Helen
Comment