Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CuffDiff Threads vs Processes

    Folks,
    I am trying to run a CuffDiff analysis as indicated below. There are 19 single-end read files and I am trying an all-vs-all comparison because that is something the users wish to know.
    The data has been though TopHat 2.0.8 with Bowtie 2.10 then Cufflinks 2.0.2. I am using a small 16 node cluster where each compute node has 4G swap, 24G ram as well and biggish local drive being run by RHEL 6.3 linux 64-bit and IBM/Platform LSF scheduler.

    I run Tophat/Bowtie with –p =8. When I get to CuffDiff I change to –p 32 ( which crashed last night as attested to below ( or –p 64 currently running )

    The program runs a long time then gets killed with an exit code

    Exited with exit code 137.

    Resource usage summary:

    CPU time : 252041.44 sec.
    Max Memory : 23188 MB
    Max Swap : 28951 MB

    Max Processes : 4
    Max Threads : 38

    > Default Std Dev: 80
    [14:33:33] Calculating preliminary abundance estimates
    > Processed 32363 loci. [*************************] 100%
    [19:13:55] Learning bias parameters.
    [19:57:59] Testing for differential expression and regulation in locus.
    > Processing Locus 1:110727025-110845380 [* ] 5%/home/hazards/.lsbatch/1365530240.7033.shell: line 15: 24074 Killed cuffdiff -o April09_CD_TH8BT210VS --max-bundle-frags=125000 -b $BOWTIE_INDEXES/rn4_ENSEMBL_genome.fa -p 32 -L C21,C22,C23,C24,C31,C32,C33,C61,C62,C63,S21,S22,S23,S31,S32,S33,S61,S62,S63 -u April09_TH8BT210VS_merged_asm/merged.gtf NAc_E2_COCAINE_0429L1S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_COCAINE_0525L1S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_COCAINE_0614L5S2-C2sw.txt.gz.TH8BT210VS/accepted_hits.bam Blumer120222HiSeqRun_Sample_Sal-2.fastq.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0429L4C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0505L5C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_COCAINE_0505L6C3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0429L6S6_C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0505L7S6-C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_COCAINE_0614L6S6-C6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0429L2C2-S2sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0505L1C2-S2swa.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E2_SALINE_0505L2C2-S2swb.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0429L3S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0505L3S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E3_SALINE_0505L4S3.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_SALINE_0429L7C6_S6sw.txt.gz.TH8BT210VS/accepted_hits.bam NAc_E6_SALINE_0505L8C6-S6sw.txt.gz.TH8BT210VS/accepted_hits.bam Blumer120222HiSeqRun_Sample_Coc-6.fastq.gz.TH8BT210VS/accepted_hits.bam



    Tue Apr 9 13:57:24: Dispatched to 32 Hosts/Processors <8*compute006> <8*comput
    e007> <8*compute008> <8*compute009>;
    Tue Apr 9 23:10:02: Completed <exit>.

    Accounting information about this job:
    Share group charged </hazards>
    CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP
    252041.44 4 33162 exit 7.6003 23188M 28951M
    ------------------------------------------------------------------------------

    SUMMARY: ( time unit: second )
    Total number of done jobs: 0 Total number of exited jobs: 1
    Total CPU time consumed: 252041.4 Average CPU time consumed: 252041.4
    Maximum CPU time of a job: 252041.4 Minimum CPU time of a job: 252041.4
    Total wait time in queues: 4.0
    Average wait time in queue: 4.0
    Maximum wait time in queue: 4.0 Minimum wait time in queue: 4.0
    Average turnaround time: 33162 (seconds/job)
    Maximum turnaround time: 33162 Minimum turnaround time: 33162
    Average hog factor of a job: 7.60 ( cpu time / turnaround time )
    Maximum hog factor of a job: 7.60 Minimum hog factor of a job: 7.60


    So as I understand it the system is killing the program because its demanding ~29-30GB of swap which the hardware does not have.

    SO I figured as follows. I'll double the number of processes to 64 ( 8 full nodes each with 4 G swap)

    Right now ( about 5 hours into the run) the job is resident on ONE node only in spite of the fact that the its been dispatched to 64 hosts

    So, what does "–p 64" mean? The manual says
    "-p/--num-threads <int> Use this many threads to align reads. The default is 1."
    The output says ( for the case where –p 32)
    Max Processes : 4
    Max Threads : 38

    How does in turn –p 32 into just 4 processes and 38 threads?

    I CAN change the model to C2,C3,C6.S2,S3,S6 and that will finish but the users of the data want to see all the replicates C21,C22,C23,C24,C31,C32,C33,C61,C62,C63,S21,S22,S23,S31,S32,S33,S61,S62,S63 at once.

    What I really want to do is get the program to finish. Got any suggestions?

    Starr

  • #2
    Hi Starr,

    Did you get to solve the issue. I am also encountering the same problem.
    The system is killing the process at 99% exactly at the location that you mentioned.

    PHP Code:
    Testing for differential expression and regulation in locus.
    Processing Locus chr7:141695678-142985271    [************************ ]  99%Killed 
    I couldn't find any fix for this until now

    Thanks!

    Comment


    • #3
      Your operating system is killing the program, look in the system logs for why. It's likely that you're running out of memory.

      Comment


      • #4
        Hi Devon,

        Thanks for the hint, I've checked my syslogs but I didn't find anything related to the crash. In anycase, I am running on a system with 24 cores and 128GB RAM. When I encountered this error, I restarted the process with only 3 BAM files per conditions (each ~500 MB). Here is my command for reference. Do you see anything unusual ??

        I've also tried to run the cuffdiff in the verbose mode and it is just stuck at this location since a long time ! And I don't know what's happening.

        PHP Code:
        Reduced 1 frags to 1 (0.000000 percent)
        Reduced 12 frags to 12 (0.000000 percent)
        Reduced 11 frags to 11 (0.000000 percent)
        Testing for differential expression and regulation in locus [chrY:59213961-59276439]
        Testing for differential expression and regulation in locus [chrY:15262680-15592550
        I've also checked my the memory consumption but it seems to occupy only ~40% of the available.

        PHP Code:
         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
         1389 xxxxx   20   0 98.4g  97g 2628 S  200 38.8   4614
        :57 cuffdiff 
        Many Thanks,
        Nikhil.

        Here is my run command:

        PHP Code:
        ~/bin/tophat-cufflinks/cuffdiff \
        --
        library-type fr-secondstrand \
        --
        dispersion-method per-condition \
        -
        ~/bin/tophat-cufflinks/indexes-bowtie/hg19_c.fa \
        -
        ~/bin/tophat-cufflinks/indexes-transcriptome/hg19_all_mrna_chr1-22XYM.gtf \
        -
        o diff_out_17.vs.18 \
        -
        -p 8 -L SP,nonSP \
        ./
        Library17/th_FC2_lane2/accepted_hits.bam,\
        ./
        Library17/th_FC2_lane3/accepted_hits.bam,\
        ./
        Library17/th_FC2_lane5/accepted_hits.bam \
        \
        ./
        Library18/th_FC2_lane3/accepted_hits.bam,\
        ./
        Library18/th_FC2_lane4/accepted_hits.bam,\
        ./
        Library18/th_FC2_lane5/accepted_hits.bam 
        Last edited by mallela; 03-15-2014, 06:54 AM.

        Comment


        • #5
          I guess a system with 128GB of RAM is unlikely to get maxed out by cufflinks

          I don't see anything obviously wrong there, unfortunately. If nothing else, try masking that region (add it to a mask.gtf file and then specify that with the -M option).

          Comment


          • #6
            Well, according to the post in cufflinks[1], I tried masking (-M option) all the repeat regions with the rmsk table obtained from the UCSC & combined with the chrM from the mRNA table of UCSC . But no success, it still hangs at 99%. I am now run out of options.

            Masking the regions where it hangs ? This seems illogical to me, cause it hangs at different regions for different runs !

            It should be noted that my problem is not with cufflinks, but is with the "cuffdiff".


            Ref:
            1. http://cufflinks.cbcb.umd.edu/faq.html#h99p

            Comment


            • #7
              Ah, I was under the impression that it was just hanging in one specific place. You might try the cufflinks google group, since you're more likely to get feedback from the original authors there.

              Comment


              • #8
                same problem here

                Have you guys had any luck? Both of my cuffdiff jobs are stuck at 99% and have been for 36-48hours now. The process isn't terminating or giving me an error message and there is plenty of memory available.

                Here's the command I used:
                cuffdiff -b index_mouse.fa -p 20 -L M1,M2 -max-bundle-frags 500000 merged_asm/merged.gtf ./M1_1_3.th/accepted_hits.bam,./M1_2_13.th/accepted_hits.bam,./M1_2_14.th/accepted_hits.bam,./M1_2_15.th/accepted_hits.bam,./M1_2_17.th/accepted_hits.bam ./M2_1_2.th/accepted_hits.bam,./M2_2_8.th/accepted_hits.bam,./M2_2_23.th/accepted_hits.bam,./M2_2_24.th/accepted_hits.bam,./M2_2_26.th/accepted_hits.bam

                Comment


                • #9
                  The cuffdiff being stuck at 99% is probably because of the duplicates in the bam file. When I marked the optical duplicates / sequence duplicates using the picard tools and then input the resultant bam files into cuffdiff, I didn't see this problem again.

                  Comment


                  • #10
                    Would you mind telling me exactly what that means? I'm brand new to data analysis, so this entire pipeline has been new to me. I'm looking at this site (https://broadinstitute.github.io/pic...-overview.html) to try and figure out what you mean by optical/sequence duplicates.

                    Is this a command I run prior to cuffdiff to remove and/or mark the duplicates?

                    Comment


                    • #11
                      Forget about the "optical/sequence" terminology. I was just being a little descriptive.
                      Just read the description on the site for MarkDuplicates and tailor your command according to the read names in your bam file. That's it.

                      For example if your read name is 611_653_1319 then the READ_NAME_REGEX=([0-9]+)_([0-9]+)_([0-9]+)

                      Read the description here, you will know what I mean.


                      And if picard is throwing out some warning and halting the process, use the option VALIDATION_STRINGENCY=SILENT
                      But its good to know why it is throwing those warnings and one should know if those warnings are okay to be ignored. For rest of the options, I usually go with the defaults.

                      Comment


                      • #12
                        So, I think I understand the use of this command, but I don't know what the read name is. When I try to look in my bam file, it comes back as a bunch of symbols

                        Comment


                        • #13
                          can you paste me the output of the following command ? I will then have a better understanding of what you mean.

                          samtools view accepted_hits.bam | head | cut -f1

                          Comment


                          • #14
                            Wow, I didn't realize I could view the sample with samtools so thank you for pointing that out! When I use that command, this is the output

                            D00356:163:C905VANXX:2:2104:15983:9631
                            D00356:163:C905VANXX:2:2108:20725:65407
                            D00356:163:C905VANXX:2:1311:11892:9460
                            D00356:163:C905VANXX:2:1303:17471:73446
                            D00356:163:C905VANXX:2:2301:2337:29197
                            D00356:163:C905VANXX:2:2304:6530:9328
                            D00356:163:C905VANXX:2:2201:5062:22265
                            D00356:163:C905VANXX:2:2106:9823:26894
                            D00356:163:C905VANXX:2:1303:17766:73767
                            D00356:163:C905VANXX:2:1102:10798:24451

                            So how would I formulate the MarkDuplicates command line with this information? Would it be something like:

                            java -jar picard.jar MarkDuplicates I=accepted_hits.bam O=marked_duplicates.bam M=marked_dup_metrics.txt

                            I'm not really sure where the Read_Name_Regex fits in. Thank you so much for all of your help!

                            Comment


                            • #15
                              These are your read names. If one has to write a regex pattern to match such names, it would be this:

                              Code:
                              "([A-Z0-9]+):([0-9]+):([A-Z0-9]+):([0-9]+):([0-9]+):([0-9]+):([0-9]+)"
                              Your picard command will then be this. Hope it works

                              Code:
                              java -Xmx4G -jar picard.jar MarkDuplicates \
                              I=accepted_hits.bam \
                              O=accepted_hits_md.bam \
                              CREATE_INDEX=true \
                              VALIDATION_STRINGENCY=SILENT \
                              M=accepted_hits_md.txt \
                              READ_NAME_REGEX="([A-Z0-9]+):([0-9]+):([A-Z0-9]+):([0-9]+):([0-9]+):([0-9]+):([0-9]+)"
                              Last edited by GenoMax; 05-09-2016, 10:15 AM. Reason: Adding [CODE] tags to improve readability

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X