Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kitinje
    Junior Member
    • Jan 2013
    • 7

    Running Tophat/Cufflinks on a cluster with *Multiple* nodes

    Hello,
    I have access to a computer cluster made of 44 nodes. Each node has 12 cores and 48 GB ram. The main problem is that jobs have a maximum walltime of 6h then they get killed, but I can use as many nodes as I like, which means that for a job I could use i.e up to 10 nodes = 120 cores x 480 GB RAM or more.
    So in order to make a tophat job to finish in 6h hours i wanted to parallelize it to a a very large number of cores with the appropriate RAM per core, and specify the number of cores through the -p parameter.

    My problem is that i can' t get tophat to recognize all the cores of the multiple nodes, and I couldn't run tophat using openmpi (mpirun -np XX tophat etc).
    So I'm wondering if tophat is multi-node capable or not? Is there a way to make it run on multiple nodes or do I need a recompiled version? I could find clues of tophat being run through mpi. (http://seqanswers.com/forums/archive...p/t-11472.html)

    Can anyone give me suggestions or alternatives?
    PS (similar situation applies to cufflinks)

    Thanks,
    Best regards,

    Marco
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Since tophat isn't written to use MPI, the instances run on each node will be blind to each other. There is no way around this without rewriting the program (actually, you'd need to rewrite bowtie as well). If you really want to use tophat, then just split your fastq files into small enough versions and then push a large number of tophat jobs onto the cluster (normally one would create a script to do this). If you're going that route anyway, just switch to STAR (you might have enough RAM, I'm not sure) and you'll get your results in a fraction of the time. The last alternative here would by Myrna, which seems to be designed with this sort of scenario in mind.

    You'll have the same problems with cufflinks, namely that increasing the number of instances with MPI won't decrease runtime since the instances won't talk to each other. Your best bet there would be eXpress or to just use count-based methods.

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      @Marco: Only 6 h of walltime per job, odd indeed (hope there is some logic behind that restriction). Not sure what kind of queuing system your cluster uses but perhaps you can make a case for a separate queue for your jobs with a walltime of at least 24 h?

      Comment

      • kitinje
        Junior Member
        • Jan 2013
        • 7

        #4
        Thank you for the quick reply, it makes things much more clear to me.
        The cluster has PBS queuing system and yes there is another queue that allows me to use a more limited number of nodes (but unnecessary at this point, at least for tophat/cufflinks) and that has 24h walltime. But I couldn't finish tophat run in 24h on a single machine. I can split FASTQ files to partially solve this first problem. But then I face the same problem with cufflinks.
        What happens if I split the bam files by chromosome? And then run 23 cufflinks jobs? Will I face serious problems in the quantification of the isoforms/normalization?

        Comment

        • NicoBxl
          not just another member
          • Aug 2010
          • 264

          #5
          like dpryan said, if you are time limited, use STAR ( very very fast and same, even better results than tophat

          Comment

          • kitinje
            Junior Member
            • Jan 2013
            • 7

            #6
            Yes, thanks. I contacted sysadmins in order to install STAR module on the server and see how it works.

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              You don't need them to install anything, you can just do that yourself (just install it into your home directory).

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                On a shared cluster it is good practice (like Marco is doing) to ask the admins to install software. Under the "modules" system (which Macro's cluster is using) admins will automatically account for dependencies/conflicts with libraries etc. A software like STAR is widely useful so having a central single install is preferable to having everyone install a local copy. Keeping genome indexes in a central location also saves on disk space.

                That said temporarily running STAR from your directory (while admins install a central copy) may be an option for the impatient :-)

                Comment

                • kitinje
                  Junior Member
                  • Jan 2013
                  • 7

                  #9
                  STAR module was installed, I'm downloading hg19 + annotations from their ftp server. If it works, in my understanding, I'll have to covert the output to bam, sort it and then create a sorted indexed bam file.

                  At this stage I still face a problem with cufflinks, I will check if i can get the job done using a single node (12:CPUS/48GB) in 24h (max walltime).
                  If not, I might split the bam file by chromosome and run 23 different jobs in as many nodes.
                  In that case, are there any available options in order to renormalize the FPKMs afterwards?
                  (I'm looking at eXpress in the meantime)

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    Is 24h the longest time slot you have available?

                    Comment

                    • kitinje
                      Junior Member
                      • Jan 2013
                      • 7

                      #11
                      Yes,
                      I have lots of available nodes but unfortunately I only have 3 possibile queues, debug (walltime 30 min), parallel (6h), longpar (24h).
                      I will ask sysadmins if they can create a 96h queue for me but i fear this is unlikely to happen. The cluster is being used for other kind of not bioinformatic related computations and I think that they don't want to reserve nodes for more then 24h.

                      Comment

                      • kitinje
                        Junior Member
                        • Jan 2013
                        • 7

                        #12
                        STAR worked flawlessy and completed the job in 30 minutes. Now I "only" need to solve my walltime problem with the quantification of the transcripts.
                        In the meantime, thank you all!

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          Look at featurecounts as an option for the quantification: http://bioinf.wehi.edu.au/featureCounts/

                          Comment

                          Latest Articles

                          Collapse

                          • GATTACAT
                            Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by GATTACAT
                            Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                            07-01-2026, 11:43 AM
                          • SEQadmin2
                            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by SEQadmin2


                            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                            Here are nine questions we think about, in roughly the order they matter, before...
                            06-18-2026, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 11:08 AM
                          0 responses
                          6 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-30-2026, 05:37 AM
                          0 responses
                          11 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-26-2026, 11:10 AM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-17-2026, 06:09 AM
                          0 responses
                          53 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...