SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Running tophat on a cluster NM_010117 Bioinformatics 14 10-31-2016 09:16 AM
Bison: BISlfite alignment On Nodes of a cluster dpryan Bioinformatics 19 10-27-2014 01:56 AM
Help installing/running TopHat and Cufflinks ifyouseqamy Bioinformatics 2 02-08-2013 11:15 AM
Some questions about running tophat & cufflinks songyj Bioinformatics 8 09-05-2012 07:09 AM
some questions about running tophat & cufflinks songyj RNA Sequencing 0 10-18-2011 06:07 PM

Reply
 
Thread Tools
Old 03-31-2014, 02:39 AM   #1
kitinje
Junior Member
 
Location: Milan

Join Date: Jan 2013
Posts: 7
Default Running Tophat/Cufflinks on a cluster with *Multiple* nodes

Hello,
I have access to a computer cluster made of 44 nodes. Each node has 12 cores and 48 GB ram. The main problem is that jobs have a maximum walltime of 6h then they get killed, but I can use as many nodes as I like, which means that for a job I could use i.e up to 10 nodes = 120 cores x 480 GB RAM or more.
So in order to make a tophat job to finish in 6h hours i wanted to parallelize it to a a very large number of cores with the appropriate RAM per core, and specify the number of cores through the -p parameter.

My problem is that i can' t get tophat to recognize all the cores of the multiple nodes, and I couldn't run tophat using openmpi (mpirun -np XX tophat etc).
So I'm wondering if tophat is multi-node capable or not? Is there a way to make it run on multiple nodes or do I need a recompiled version? I could find clues of tophat being run through mpi. (http://seqanswers.com/forums/archive...p/t-11472.html)

Can anyone give me suggestions or alternatives?
PS (similar situation applies to cufflinks)

Thanks,
Best regards,

Marco
kitinje is offline   Reply With Quote
Old 03-31-2014, 02:51 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Since tophat isn't written to use MPI, the instances run on each node will be blind to each other. There is no way around this without rewriting the program (actually, you'd need to rewrite bowtie as well). If you really want to use tophat, then just split your fastq files into small enough versions and then push a large number of tophat jobs onto the cluster (normally one would create a script to do this). If you're going that route anyway, just switch to STAR (you might have enough RAM, I'm not sure) and you'll get your results in a fraction of the time. The last alternative here would by Myrna, which seems to be designed with this sort of scenario in mind.

You'll have the same problems with cufflinks, namely that increasing the number of instances with MPI won't decrease runtime since the instances won't talk to each other. Your best bet there would be eXpress or to just use count-based methods.
dpryan is offline   Reply With Quote
Old 03-31-2014, 04:25 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,081
Default

@Marco: Only 6 h of walltime per job, odd indeed (hope there is some logic behind that restriction). Not sure what kind of queuing system your cluster uses but perhaps you can make a case for a separate queue for your jobs with a walltime of at least 24 h?
GenoMax is offline   Reply With Quote
Old 03-31-2014, 06:37 AM   #4
kitinje
Junior Member
 
Location: Milan

Join Date: Jan 2013
Posts: 7
Default

Thank you for the quick reply, it makes things much more clear to me.
The cluster has PBS queuing system and yes there is another queue that allows me to use a more limited number of nodes (but unnecessary at this point, at least for tophat/cufflinks) and that has 24h walltime. But I couldn't finish tophat run in 24h on a single machine. I can split FASTQ files to partially solve this first problem. But then I face the same problem with cufflinks.
What happens if I split the bam files by chromosome? And then run 23 cufflinks jobs? Will I face serious problems in the quantification of the isoforms/normalization?
kitinje is offline   Reply With Quote
Old 03-31-2014, 07:11 AM   #5
NicoBxl
not just another member
 
Location: Belgium

Join Date: Aug 2010
Posts: 264
Default

like dpryan said, if you are time limited, use STAR ( very very fast and same, even better results than tophat
NicoBxl is offline   Reply With Quote
Old 03-31-2014, 07:20 AM   #6
kitinje
Junior Member
 
Location: Milan

Join Date: Jan 2013
Posts: 7
Default

Yes, thanks. I contacted sysadmins in order to install STAR module on the server and see how it works.
kitinje is offline   Reply With Quote
Old 03-31-2014, 07:22 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You don't need them to install anything, you can just do that yourself (just install it into your home directory).
dpryan is offline   Reply With Quote
Old 03-31-2014, 07:36 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,081
Default

On a shared cluster it is good practice (like Marco is doing) to ask the admins to install software. Under the "modules" system (which Macro's cluster is using) admins will automatically account for dependencies/conflicts with libraries etc. A software like STAR is widely useful so having a central single install is preferable to having everyone install a local copy. Keeping genome indexes in a central location also saves on disk space.

That said temporarily running STAR from your directory (while admins install a central copy) may be an option for the impatient :-)
GenoMax is offline   Reply With Quote
Old 03-31-2014, 09:14 AM   #9
kitinje
Junior Member
 
Location: Milan

Join Date: Jan 2013
Posts: 7
Default

STAR module was installed, I'm downloading hg19 + annotations from their ftp server. If it works, in my understanding, I'll have to covert the output to bam, sort it and then create a sorted indexed bam file.

At this stage I still face a problem with cufflinks, I will check if i can get the job done using a single node (12:CPUS/48GB) in 24h (max walltime).
If not, I might split the bam file by chromosome and run 23 different jobs in as many nodes.
In that case, are there any available options in order to renormalize the FPKMs afterwards?
(I'm looking at eXpress in the meantime)
kitinje is offline   Reply With Quote
Old 03-31-2014, 09:36 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,081
Default

Is 24h the longest time slot you have available?
GenoMax is offline   Reply With Quote
Old 03-31-2014, 09:48 AM   #11
kitinje
Junior Member
 
Location: Milan

Join Date: Jan 2013
Posts: 7
Default

Yes,
I have lots of available nodes but unfortunately I only have 3 possibile queues, debug (walltime 30 min), parallel (6h), longpar (24h).
I will ask sysadmins if they can create a 96h queue for me but i fear this is unlikely to happen. The cluster is being used for other kind of not bioinformatic related computations and I think that they don't want to reserve nodes for more then 24h.
kitinje is offline   Reply With Quote
Old 03-31-2014, 04:33 PM   #12
kitinje
Junior Member
 
Location: Milan

Join Date: Jan 2013
Posts: 7
Thumbs up

STAR worked flawlessy and completed the job in 30 minutes. Now I "only" need to solve my walltime problem with the quantification of the transcripts.
In the meantime, thank you all!
kitinje is offline   Reply With Quote
Old 03-31-2014, 06:42 PM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,081
Default

Look at featurecounts as an option for the quantification: http://bioinf.wehi.edu.au/featureCounts/
GenoMax is offline   Reply With Quote
Reply

Tags
cluster, nodes, rnaseq, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO