Folks,
I am trying to run a CuffDiff analysis as indicated below. There are 19 single-end read files and I am trying an all-vs-all comparison because that is something the users wish to know.
The data has been though TopHat 2.0.8 with Bowtie 2.10 then Cufflinks 2.0.2. I am using a small 16 node cluster where each compute node has 4G swap, 24G ram as well and biggish local drive being run by RHEL 6.3 linux 64-bit and IBM/Platform LSF scheduler.
I run Tophat/Bowtie with –p =8. When I get to CuffDiff I change to –p 32 ( which crashed last night as attested to below ( or –p 64 currently running )
The program runs a long time then gets killed with an exit code
Exited with exit code 137.
Resource usage summary:
CPU time : 252041.44 sec.
Max Memory : 23188 MB
Max Swap : 28951 MB
Max Processes : 4
Max Threads : 38
> Default Std Dev: 80
[14:33:33] Calculating preliminary abundance estimates
> Processed 32363 loci. [*************************] 100%
[19:13:55] Learning bias parameters.
[19:57:59] Testing for differential expression and regulation in locus.
> Processing Locus 1:110727025-110845380 [* ] 5%/home/hazards/.lsbatch/1365530240.7033.shell: line 15: 24074 Killed cuffdiff -o April09_CD_TH8BT210VS --max-bundle-frags=125000 -b $BOWTIE_INDEXES/rn4_ENSEMBL_genome.fa -p 32 -L C21,C22,C23,C24,C31,C32,C33,C61,C62,C63,S21,S22,S23,S31,S32,S33,S61,S62,S63 -u April09_TH8BT210VS_merged_asm/merged.gtf NAc_E2_COCAINE_0429L1S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_COCAINE_0525L1S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_COCAINE_0614L5S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam Blumer120222HiSeqRun_Sample_Sal-2.fastq.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0429L4C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0505L5C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0505L6C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0429L6S6_C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0505L7S6-C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0614L6S6-C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0429L2C2-S2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0505L1C2-S2swa.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0505L2C2-S2swb.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0429L3S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0505L3S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0505L4S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_SALINE_0429L7C6_S6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_SALINE_0505L8C6-S6sw.txt.gz.TH8BT210VS/accepted_hits.bam Blumer120222HiSeqRun_Sample_Coc-6.fastq.gz.TH8BT210VS/accepted_hits.bam
Tue Apr 9 13:57:24: Dispatched to 32 Hosts/Processors <8*compute006> <8*comput
e007> <8*compute008> <8*compute009>;
Tue Apr 9 23:10:02: Completed <exit>.
Accounting information about this job:
Share group charged </hazards>
CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP
252041.44 4 33162 exit 7.6003 23188M 28951M
------------------------------------------------------------------------------
SUMMARY: ( time unit: second )
Total number of done jobs: 0 Total number of exited jobs: 1
Total CPU time consumed: 252041.4 Average CPU time consumed: 252041.4
Maximum CPU time of a job: 252041.4 Minimum CPU time of a job: 252041.4
Total wait time in queues: 4.0
Average wait time in queue: 4.0
Maximum wait time in queue: 4.0 Minimum wait time in queue: 4.0
Average turnaround time: 33162 (seconds/job)
Maximum turnaround time: 33162 Minimum turnaround time: 33162
Average hog factor of a job: 7.60 ( cpu time / turnaround time )
Maximum hog factor of a job: 7.60 Minimum hog factor of a job: 7.60
So as I understand it the system is killing the program because its demanding ~29-30GB of swap which the hardware does not have.
SO I figured as follows. I'll double the number of processes to 64 ( 8 full nodes each with 4 G swap)
Right now ( about 5 hours into the run) the job is resident on ONE node only in spite of the fact that the its been dispatched to 64 hosts
So, what does "–p 64" mean? The manual says
"-p/--num-threads <int> Use this many threads to align reads. The default is 1."
The output says ( for the case where –p 32)
Max Processes : 4
Max Threads : 38
How does in turn –p 32 into just 4 processes and 38 threads?
I CAN change the model to C2,C3,C6.S2,S3,S6 and that will finish but the users of the data want to see all the replicates C21,C22,C23,C24,C31,C32,C33,C61,C62,C63,S21,S22,S23,S31,S32,S33,S61,S62,S63 at once.
What I really want to do is get the program to finish. Got any suggestions?
Starr
I am trying to run a CuffDiff analysis as indicated below. There are 19 single-end read files and I am trying an all-vs-all comparison because that is something the users wish to know.
The data has been though TopHat 2.0.8 with Bowtie 2.10 then Cufflinks 2.0.2. I am using a small 16 node cluster where each compute node has 4G swap, 24G ram as well and biggish local drive being run by RHEL 6.3 linux 64-bit and IBM/Platform LSF scheduler.
I run Tophat/Bowtie with –p =8. When I get to CuffDiff I change to –p 32 ( which crashed last night as attested to below ( or –p 64 currently running )
The program runs a long time then gets killed with an exit code
Exited with exit code 137.
Resource usage summary:
CPU time : 252041.44 sec.
Max Memory : 23188 MB
Max Swap : 28951 MB
Max Processes : 4
Max Threads : 38
> Default Std Dev: 80
[14:33:33] Calculating preliminary abundance estimates
> Processed 32363 loci. [*************************] 100%
[19:13:55] Learning bias parameters.
[19:57:59] Testing for differential expression and regulation in locus.
> Processing Locus 1:110727025-110845380 [* ] 5%/home/hazards/.lsbatch/1365530240.7033.shell: line 15: 24074 Killed cuffdiff -o April09_CD_TH8BT210VS --max-bundle-frags=125000 -b $BOWTIE_INDEXES/rn4_ENSEMBL_genome.fa -p 32 -L C21,C22,C23,C24,C31,C32,C33,C61,C62,C63,S21,S22,S23,S31,S32,S33,S61,S62,S63 -u April09_TH8BT210VS_merged_asm/merged.gtf NAc_E2_COCAINE_0429L1S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_COCAINE_0525L1S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_COCAINE_0614L5S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam Blumer120222HiSeqRun_Sample_Sal-2.fastq.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0429L4C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0505L5C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0505L6C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0429L6S6_C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0505L7S6-C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0614L6S6-C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0429L2C2-S2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0505L1C2-S2swa.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0505L2C2-S2swb.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0429L3S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0505L3S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0505L4S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_SALINE_0429L7C6_S6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_SALINE_0505L8C6-S6sw.txt.gz.TH8BT210VS/accepted_hits.bam Blumer120222HiSeqRun_Sample_Coc-6.fastq.gz.TH8BT210VS/accepted_hits.bam
Tue Apr 9 13:57:24: Dispatched to 32 Hosts/Processors <8*compute006> <8*comput
e007> <8*compute008> <8*compute009>;
Tue Apr 9 23:10:02: Completed <exit>.
Accounting information about this job:
Share group charged </hazards>
CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP
252041.44 4 33162 exit 7.6003 23188M 28951M
------------------------------------------------------------------------------
SUMMARY: ( time unit: second )
Total number of done jobs: 0 Total number of exited jobs: 1
Total CPU time consumed: 252041.4 Average CPU time consumed: 252041.4
Maximum CPU time of a job: 252041.4 Minimum CPU time of a job: 252041.4
Total wait time in queues: 4.0
Average wait time in queue: 4.0
Maximum wait time in queue: 4.0 Minimum wait time in queue: 4.0
Average turnaround time: 33162 (seconds/job)
Maximum turnaround time: 33162 Minimum turnaround time: 33162
Average hog factor of a job: 7.60 ( cpu time / turnaround time )
Maximum hog factor of a job: 7.60 Minimum hog factor of a job: 7.60
So as I understand it the system is killing the program because its demanding ~29-30GB of swap which the hardware does not have.
SO I figured as follows. I'll double the number of processes to 64 ( 8 full nodes each with 4 G swap)
Right now ( about 5 hours into the run) the job is resident on ONE node only in spite of the fact that the its been dispatched to 64 hosts
So, what does "–p 64" mean? The manual says
"-p/--num-threads <int> Use this many threads to align reads. The default is 1."
The output says ( for the case where –p 32)
Max Processes : 4
Max Threads : 38
How does in turn –p 32 into just 4 processes and 38 threads?
I CAN change the model to C2,C3,C6.S2,S3,S6 and that will finish but the users of the data want to see all the replicates C21,C22,C23,C24,C31,C32,C33,C61,C62,C63,S21,S22,S23,S31,S32,S33,S61,S62,S63 at once.
What I really want to do is get the program to finish. Got any suggestions?
Starr
Comment