Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks Runtime

    Hello, we're beginners who are using cufflinks for the first time to assemble a transcriptome. We have ~80 million tophat aligned reads and have been surprised by the significant amount of time that it has taken to assemble transcripts with cufflinks. We're not quite sure if it's an issue of computing power, something we've done incorrectly, or, since we're beginners, just the standard amount of time required. We are not using a reference annotation for assembly. At the pace the assembly has been moving, it looks like it will take 6-7 days to complete the assembly for this ~80 million read library. Is this normal? Any suggestions or comments would be greatly appreciated! Here are the specs:

    8 cpu's running at 2.83 GHz
    32 GB RAM
    8.8 TB free memory

  • #2
    We're also having very similar issues. Our clusters gives us 96 hours to complete our jobs but sometimes we hit this wall time. Any tips to improve speed with cufflinks are highly appreciated.

    Comment


    • #3
      Which step did you notice the program spent most of the time? Below is one entry in the Cufflinks FAQ:

      I'm trying to assemble a sample. Cufflinks is almost done, but it seems to be hanging at "99% complete". What's going on?

      Cufflinks spawns threads for each locus to assemble and quantitate the "bundle" of reads in that locus. Some loci may have more reads and more complicated alternative splicing than others, which requires more CPU cycles. These bundles can continue processing long after all others have completed, leading to this behavior. You may be able to decrease the number of such bundles by masking out ribosomal and mitochondrial RNA using the -M/--mask-file option described in the Manual.

      Comment


      • #4
        Originally posted by DZhang View Post
        Which step did you notice the program spent most of the time? Below is one entry in the Cufflinks FAQ:

        I'm trying to assemble a sample. Cufflinks is almost done, but it seems to be hanging at "99% complete". What's going on?

        Cufflinks spawns threads for each locus to assemble and quantitate the "bundle" of reads in that locus. Some loci may have more reads and more complicated alternative splicing than others, which requires more CPU cycles. These bundles can continue processing long after all others have completed, leading to this behavior. You may be able to decrease the number of such bundles by masking out ribosomal and mitochondrial RNA using the -M/--mask-file option described in the Manual.
        We noticed it hanging around 71% for a particularly long time one day, but since we had to leave it running over several nights, it's hard to say whether or not this was unusual.

        Also, it seems to take about the same amount of time regardless of how many threads we tell it to use (we've tried 1, 4, 7, and 8). However, that answer from the FAQ makes it sound like cufflinks spawns threads automatically, so we're wondering if maybe we misunderstood the -p option?

        Comment


        • #5
          I believe the -p option at least works for bowtie.

          Douglas

          Comment


          • #6
            We have ~175 million aligned reads. I am running cufflinks version 1.0.3. It has been running since last 8 days and is not yet completed. The cufflinks output files are being updated after 1 or 2 days. Is this normal?

            here are the cufflinks options I used:
            cufflinks --GTF-guide refseq.gtf --frag-bias-correct /indexes --multi-read-correct -p 12

            The machine specs are:
            32 processors (2.4 GHz each)
            512 GB RAM

            Any thoughts would be greatly appreciated!

            Thanks!

            -Dhiral.

            Comment


            • #7
              I think you should consider using a mask file (see post #3 above). cufflinks was also taking a long time to run on my data; when I had a look at the region where it was stalling, I could see that very many reads were aligning there. Creating a GFF file to mask these regions (with -M) solved the problem in my case.

              Comment


              • #8
                Originally posted by thurisaz View Post
                I think you should consider using a mask file (see post #3 above). cufflinks was also taking a long time to run on my data; when I had a look at the region where it was stalling, I could see that very many reads were aligning there. Creating a GFF file to mask these regions (with -M) solved the problem in my case.
                You created a mask file with the regions that it was stalling at? There could be valid transcripts in those regions if several reads were aligning there. Or did you mask out specific ribosomal RNA and mitochondiral RNA regions?

                Comment


                • #9
                  Yes, I created a mask file to exclude the regions it was stalling at, since they were hugely over-represented and the analysis wouldn't finish otherwise. Comparing them now, however, I see that they do cover the annotated rRNA as well as some extra regions:

                  Code:
                  [B]Problem areas in my run:[/B]
                  Chr2    TAIR10  exon    1900    10200   .       .       .       ID=Chr2_problem_area
                  Chr3    TAIR10  exon    14143000        14145000        .       .       .       ID=Chr3_problem_area1
                  Chr3    TAIR10  exon    14195800        14204100        .       .       .       ID=Chr3_problem_area2
                  
                  [B]Annotated rRNA:[/B]
                  Chr2  TAIR10  rRNA  5782  5945  . + . ID=AT2G01020.1;Parent=AT2G01020;Name=AT2G01020.1;Index=1                                                                                       
                  Chr3  TAIR10  rRNA  14197677  14199484  . + . ID=AT3G41768.1;Parent=AT3G41768;Name=AT3G41768.1;Index=1                                                                               
                  Chr3  TAIR10  rRNA  14199753  14199916  . + . ID=AT3G41979.1;Parent=AT3G41979;Name=AT3G41979.1;Index=1

                  Comment


                  • #10
                    did anyone find a way around getting Cufflinks to work faster on a large file without masking transcripts or making cufflinks run for a longer period of time?

                    ****I wish I could just divide the file in half and then figure out a way to merge the FPKMs***

                    Comment


                    • #11
                      Originally posted by zorph View Post
                      did anyone find a way around getting Cufflinks to work faster on a large file without masking transcripts or making cufflinks run for a longer period of time?

                      ****I wish I could just divide the file in half and then figure out a way to merge the FPKMs***
                      You could in theory divide the input bams by which chromosomes the reads map to and then run a seperate cufflinks process for each chromosome. You'd have to find some way to renormalise the FPKMs afterwards.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Advancing Precision Medicine for Rare Diseases in Children
                        by seqadmin




                        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                        12-16-2024, 07:57 AM
                      • seqadmin
                        Recent Advances in Sequencing Technologies
                        by seqadmin



                        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                        Long-Read Sequencing
                        Long-read sequencing has seen remarkable advancements,...
                        12-02-2024, 01:49 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 12-17-2024, 10:28 AM
                      0 responses
                      33 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-13-2024, 08:24 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-12-2024, 07:41 AM
                      0 responses
                      34 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-11-2024, 07:45 AM
                      0 responses
                      46 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X