![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Running tophat on a cluster | NM_010117 | Bioinformatics | 14 | 10-31-2016 09:16 AM |
Bison: BISlfite alignment On Nodes of a cluster | dpryan | Bioinformatics | 19 | 10-27-2014 01:56 AM |
Help installing/running TopHat and Cufflinks | ifyouseqamy | Bioinformatics | 2 | 02-08-2013 11:15 AM |
Some questions about running tophat & cufflinks | songyj | Bioinformatics | 8 | 09-05-2012 07:09 AM |
some questions about running tophat & cufflinks | songyj | RNA Sequencing | 0 | 10-18-2011 06:07 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Milan Join Date: Jan 2013
Posts: 7
|
![]()
Hello,
I have access to a computer cluster made of 44 nodes. Each node has 12 cores and 48 GB ram. The main problem is that jobs have a maximum walltime of 6h then they get killed, but I can use as many nodes as I like, which means that for a job I could use i.e up to 10 nodes = 120 cores x 480 GB RAM or more. So in order to make a tophat job to finish in 6h hours i wanted to parallelize it to a a very large number of cores with the appropriate RAM per core, and specify the number of cores through the -p parameter. My problem is that i can' t get tophat to recognize all the cores of the multiple nodes, and I couldn't run tophat using openmpi (mpirun -np XX tophat etc). So I'm wondering if tophat is multi-node capable or not? Is there a way to make it run on multiple nodes or do I need a recompiled version? I could find clues of tophat being run through mpi. (http://seqanswers.com/forums/archive...p/t-11472.html) Can anyone give me suggestions or alternatives? PS (similar situation applies to cufflinks) Thanks, Best regards, Marco |
![]() |
![]() |
![]() |
#2 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Since tophat isn't written to use MPI, the instances run on each node will be blind to each other. There is no way around this without rewriting the program (actually, you'd need to rewrite bowtie as well). If you really want to use tophat, then just split your fastq files into small enough versions and then push a large number of tophat jobs onto the cluster (normally one would create a script to do this). If you're going that route anyway, just switch to STAR (you might have enough RAM, I'm not sure) and you'll get your results in a fraction of the time. The last alternative here would by Myrna, which seems to be designed with this sort of scenario in mind.
You'll have the same problems with cufflinks, namely that increasing the number of instances with MPI won't decrease runtime since the instances won't talk to each other. Your best bet there would be eXpress or to just use count-based methods. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,081
|
![]()
@Marco: Only 6 h of walltime per job, odd indeed (hope there is some logic behind that restriction). Not sure what kind of queuing system your cluster uses but perhaps you can make a case for a separate queue for your jobs with a walltime of at least 24 h?
|
![]() |
![]() |
![]() |
#4 |
Junior Member
Location: Milan Join Date: Jan 2013
Posts: 7
|
![]()
Thank you for the quick reply, it makes things much more clear to me.
The cluster has PBS queuing system and yes there is another queue that allows me to use a more limited number of nodes (but unnecessary at this point, at least for tophat/cufflinks) and that has 24h walltime. But I couldn't finish tophat run in 24h on a single machine. I can split FASTQ files to partially solve this first problem. But then I face the same problem with cufflinks. What happens if I split the bam files by chromosome? And then run 23 cufflinks jobs? Will I face serious problems in the quantification of the isoforms/normalization? |
![]() |
![]() |
![]() |
#5 |
not just another member
Location: Belgium Join Date: Aug 2010
Posts: 264
|
![]()
like dpryan said, if you are time limited, use STAR ( very very fast and same, even better results than tophat
|
![]() |
![]() |
![]() |
#6 |
Junior Member
Location: Milan Join Date: Jan 2013
Posts: 7
|
![]()
Yes, thanks. I contacted sysadmins in order to install STAR module on the server and see how it works.
|
![]() |
![]() |
![]() |
#7 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
You don't need them to install anything, you can just do that yourself (just install it into your home directory).
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,081
|
![]()
On a shared cluster it is good practice (like Marco is doing) to ask the admins to install software. Under the "modules" system (which Macro's cluster is using) admins will automatically account for dependencies/conflicts with libraries etc. A software like STAR is widely useful so having a central single install is preferable to having everyone install a local copy. Keeping genome indexes in a central location also saves on disk space.
That said temporarily running STAR from your directory (while admins install a central copy) may be an option for the impatient :-) |
![]() |
![]() |
![]() |
#9 |
Junior Member
Location: Milan Join Date: Jan 2013
Posts: 7
|
![]()
STAR module was installed, I'm downloading hg19 + annotations from their ftp server. If it works, in my understanding, I'll have to covert the output to bam, sort it and then create a sorted indexed bam file.
At this stage I still face a problem with cufflinks, I will check if i can get the job done using a single node (12:CPUS/48GB) in 24h (max walltime). If not, I might split the bam file by chromosome and run 23 different jobs in as many nodes. In that case, are there any available options in order to renormalize the FPKMs afterwards? (I'm looking at eXpress in the meantime) |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,081
|
![]()
Is 24h the longest time slot you have available?
|
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: Milan Join Date: Jan 2013
Posts: 7
|
![]()
Yes,
I have lots of available nodes but unfortunately I only have 3 possibile queues, debug (walltime 30 min), parallel (6h), longpar (24h). I will ask sysadmins if they can create a 96h queue for me but i fear this is unlikely to happen. The cluster is being used for other kind of not bioinformatic related computations and I think that they don't want to reserve nodes for more then 24h. |
![]() |
![]() |
![]() |
#12 |
Junior Member
Location: Milan Join Date: Jan 2013
Posts: 7
|
![]()
STAR worked flawlessy and completed the job in 30 minutes. Now I "only" need to solve my walltime problem with the quantification of the transcripts.
In the meantime, thank you all! |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,081
|
![]()
Look at featurecounts as an option for the quantification: http://bioinf.wehi.edu.au/featureCounts/
|
![]() |
![]() |
![]() |
Tags |
cluster, nodes, rnaseq, tophat |
Thread Tools | |
|
|