Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks Runtime

    Hello, we're beginners who are using cufflinks for the first time to assemble a transcriptome. We have ~80 million tophat aligned reads and have been surprised by the significant amount of time that it has taken to assemble transcripts with cufflinks. We're not quite sure if it's an issue of computing power, something we've done incorrectly, or, since we're beginners, just the standard amount of time required. We are not using a reference annotation for assembly. At the pace the assembly has been moving, it looks like it will take 6-7 days to complete the assembly for this ~80 million read library. Is this normal? Any suggestions or comments would be greatly appreciated! Here are the specs:

    8 cpu's running at 2.83 GHz
    32 GB RAM
    8.8 TB free memory

  • #2
    We're also having very similar issues. Our clusters gives us 96 hours to complete our jobs but sometimes we hit this wall time. Any tips to improve speed with cufflinks are highly appreciated.

    Comment


    • #3
      Which step did you notice the program spent most of the time? Below is one entry in the Cufflinks FAQ:

      I'm trying to assemble a sample. Cufflinks is almost done, but it seems to be hanging at "99% complete". What's going on?

      Cufflinks spawns threads for each locus to assemble and quantitate the "bundle" of reads in that locus. Some loci may have more reads and more complicated alternative splicing than others, which requires more CPU cycles. These bundles can continue processing long after all others have completed, leading to this behavior. You may be able to decrease the number of such bundles by masking out ribosomal and mitochondrial RNA using the -M/--mask-file option described in the Manual.

      Comment


      • #4
        Originally posted by DZhang View Post
        Which step did you notice the program spent most of the time? Below is one entry in the Cufflinks FAQ:

        I'm trying to assemble a sample. Cufflinks is almost done, but it seems to be hanging at "99% complete". What's going on?

        Cufflinks spawns threads for each locus to assemble and quantitate the "bundle" of reads in that locus. Some loci may have more reads and more complicated alternative splicing than others, which requires more CPU cycles. These bundles can continue processing long after all others have completed, leading to this behavior. You may be able to decrease the number of such bundles by masking out ribosomal and mitochondrial RNA using the -M/--mask-file option described in the Manual.
        We noticed it hanging around 71% for a particularly long time one day, but since we had to leave it running over several nights, it's hard to say whether or not this was unusual.

        Also, it seems to take about the same amount of time regardless of how many threads we tell it to use (we've tried 1, 4, 7, and 8). However, that answer from the FAQ makes it sound like cufflinks spawns threads automatically, so we're wondering if maybe we misunderstood the -p option?

        Comment


        • #5
          I believe the -p option at least works for bowtie.

          Douglas

          Comment


          • #6
            We have ~175 million aligned reads. I am running cufflinks version 1.0.3. It has been running since last 8 days and is not yet completed. The cufflinks output files are being updated after 1 or 2 days. Is this normal?

            here are the cufflinks options I used:
            cufflinks --GTF-guide refseq.gtf --frag-bias-correct /indexes --multi-read-correct -p 12

            The machine specs are:
            32 processors (2.4 GHz each)
            512 GB RAM

            Any thoughts would be greatly appreciated!

            Thanks!

            -Dhiral.

            Comment


            • #7
              I think you should consider using a mask file (see post #3 above). cufflinks was also taking a long time to run on my data; when I had a look at the region where it was stalling, I could see that very many reads were aligning there. Creating a GFF file to mask these regions (with -M) solved the problem in my case.

              Comment


              • #8
                Originally posted by thurisaz View Post
                I think you should consider using a mask file (see post #3 above). cufflinks was also taking a long time to run on my data; when I had a look at the region where it was stalling, I could see that very many reads were aligning there. Creating a GFF file to mask these regions (with -M) solved the problem in my case.
                You created a mask file with the regions that it was stalling at? There could be valid transcripts in those regions if several reads were aligning there. Or did you mask out specific ribosomal RNA and mitochondiral RNA regions?

                Comment


                • #9
                  Yes, I created a mask file to exclude the regions it was stalling at, since they were hugely over-represented and the analysis wouldn't finish otherwise. Comparing them now, however, I see that they do cover the annotated rRNA as well as some extra regions:

                  Code:
                  [B]Problem areas in my run:[/B]
                  Chr2    TAIR10  exon    1900    10200   .       .       .       ID=Chr2_problem_area
                  Chr3    TAIR10  exon    14143000        14145000        .       .       .       ID=Chr3_problem_area1
                  Chr3    TAIR10  exon    14195800        14204100        .       .       .       ID=Chr3_problem_area2
                  
                  [B]Annotated rRNA:[/B]
                  Chr2  TAIR10  rRNA  5782  5945  . + . ID=AT2G01020.1;Parent=AT2G01020;Name=AT2G01020.1;Index=1                                                                                       
                  Chr3  TAIR10  rRNA  14197677  14199484  . + . ID=AT3G41768.1;Parent=AT3G41768;Name=AT3G41768.1;Index=1                                                                               
                  Chr3  TAIR10  rRNA  14199753  14199916  . + . ID=AT3G41979.1;Parent=AT3G41979;Name=AT3G41979.1;Index=1

                  Comment


                  • #10
                    did anyone find a way around getting Cufflinks to work faster on a large file without masking transcripts or making cufflinks run for a longer period of time?

                    ****I wish I could just divide the file in half and then figure out a way to merge the FPKMs***

                    Comment


                    • #11
                      Originally posted by zorph View Post
                      did anyone find a way around getting Cufflinks to work faster on a large file without masking transcripts or making cufflinks run for a longer period of time?

                      ****I wish I could just divide the file in half and then figure out a way to merge the FPKMs***
                      You could in theory divide the input bams by which chromosomes the reads map to and then run a seperate cufflinks process for each chromosome. You'd have to find some way to renormalise the FPKMs afterwards.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 11:49 AM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-24-2024, 08:47 AM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      61 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X