Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running Tophat/Cufflinks on a cluster with *Multiple* nodes

    Hello,
    I have access to a computer cluster made of 44 nodes. Each node has 12 cores and 48 GB ram. The main problem is that jobs have a maximum walltime of 6h then they get killed, but I can use as many nodes as I like, which means that for a job I could use i.e up to 10 nodes = 120 cores x 480 GB RAM or more.
    So in order to make a tophat job to finish in 6h hours i wanted to parallelize it to a a very large number of cores with the appropriate RAM per core, and specify the number of cores through the -p parameter.

    My problem is that i can' t get tophat to recognize all the cores of the multiple nodes, and I couldn't run tophat using openmpi (mpirun -np XX tophat etc).
    So I'm wondering if tophat is multi-node capable or not? Is there a way to make it run on multiple nodes or do I need a recompiled version? I could find clues of tophat being run through mpi. (http://seqanswers.com/forums/archive...p/t-11472.html)

    Can anyone give me suggestions or alternatives?
    PS (similar situation applies to cufflinks)

    Thanks,
    Best regards,

    Marco

  • #2
    Since tophat isn't written to use MPI, the instances run on each node will be blind to each other. There is no way around this without rewriting the program (actually, you'd need to rewrite bowtie as well). If you really want to use tophat, then just split your fastq files into small enough versions and then push a large number of tophat jobs onto the cluster (normally one would create a script to do this). If you're going that route anyway, just switch to STAR (you might have enough RAM, I'm not sure) and you'll get your results in a fraction of the time. The last alternative here would by Myrna, which seems to be designed with this sort of scenario in mind.

    You'll have the same problems with cufflinks, namely that increasing the number of instances with MPI won't decrease runtime since the instances won't talk to each other. Your best bet there would be eXpress or to just use count-based methods.

    Comment


    • #3
      @Marco: Only 6 h of walltime per job, odd indeed (hope there is some logic behind that restriction). Not sure what kind of queuing system your cluster uses but perhaps you can make a case for a separate queue for your jobs with a walltime of at least 24 h?

      Comment


      • #4
        Thank you for the quick reply, it makes things much more clear to me.
        The cluster has PBS queuing system and yes there is another queue that allows me to use a more limited number of nodes (but unnecessary at this point, at least for tophat/cufflinks) and that has 24h walltime. But I couldn't finish tophat run in 24h on a single machine. I can split FASTQ files to partially solve this first problem. But then I face the same problem with cufflinks.
        What happens if I split the bam files by chromosome? And then run 23 cufflinks jobs? Will I face serious problems in the quantification of the isoforms/normalization?

        Comment


        • #5
          like dpryan said, if you are time limited, use STAR ( very very fast and same, even better results than tophat

          Comment


          • #6
            Yes, thanks. I contacted sysadmins in order to install STAR module on the server and see how it works.

            Comment


            • #7
              You don't need them to install anything, you can just do that yourself (just install it into your home directory).

              Comment


              • #8
                On a shared cluster it is good practice (like Marco is doing) to ask the admins to install software. Under the "modules" system (which Macro's cluster is using) admins will automatically account for dependencies/conflicts with libraries etc. A software like STAR is widely useful so having a central single install is preferable to having everyone install a local copy. Keeping genome indexes in a central location also saves on disk space.

                That said temporarily running STAR from your directory (while admins install a central copy) may be an option for the impatient :-)

                Comment


                • #9
                  STAR module was installed, I'm downloading hg19 + annotations from their ftp server. If it works, in my understanding, I'll have to covert the output to bam, sort it and then create a sorted indexed bam file.

                  At this stage I still face a problem with cufflinks, I will check if i can get the job done using a single node (12:CPUS/48GB) in 24h (max walltime).
                  If not, I might split the bam file by chromosome and run 23 different jobs in as many nodes.
                  In that case, are there any available options in order to renormalize the FPKMs afterwards?
                  (I'm looking at eXpress in the meantime)

                  Comment


                  • #10
                    Is 24h the longest time slot you have available?

                    Comment


                    • #11
                      Yes,
                      I have lots of available nodes but unfortunately I only have 3 possibile queues, debug (walltime 30 min), parallel (6h), longpar (24h).
                      I will ask sysadmins if they can create a 96h queue for me but i fear this is unlikely to happen. The cluster is being used for other kind of not bioinformatic related computations and I think that they don't want to reserve nodes for more then 24h.

                      Comment


                      • #12
                        STAR worked flawlessy and completed the job in 30 minutes. Now I "only" need to solve my walltime problem with the quantification of the transcripts.
                        In the meantime, thank you all!

                        Comment


                        • #13
                          Look at featurecounts as an option for the quantification: http://bioinf.wehi.edu.au/featureCounts/

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 03-27-2024, 06:37 PM
                          0 responses
                          12 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-27-2024, 06:07 PM
                          0 responses
                          11 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          69 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X