![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Using BGZF (Blocked GNU Zip Format) for general sequence files | maubp | Bioinformatics | 7 | 02-08-2013 03:33 PM |
Help with parallel transcriptome assembly. | matthew.christenson | RNA Sequencing | 4 | 09-27-2012 11:23 PM |
tophat running parallel | mirreke | RNA Sequencing | 1 | 10-04-2011 08:32 AM |
Bfast parallel running | Vincenzo | General | 3 | 02-22-2011 08:49 AM |
Parallel Bioinformatics tools | geschickten | Bioinformatics | 5 | 07-21-2010 05:38 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Oregon Join Date: Apr 2011
Posts: 205
|
![]()
I wonder if anyone tried speeding up bioinformatics jobs with GNU parallel - any experience or just thoughts?
|
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Denmark Join Date: Apr 2009
Posts: 153
|
![]()
GNU parallel is brilliant for executing command line tools in a Unix/Linux setup with multiple servers/CPUs. It works very well with Biopieces. See the HowTo.
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Charlottesville, VA Join Date: May 2011
Posts: 112
|
![]()
I use it all the time in place of xargs.
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Oregon Join Date: Apr 2011
Posts: 205
|
![]()
Yep, Biopieces is one example, although How To carefully says it can be used for some tasks. As examples for use of parallel do not include much of bioinformatic tasks, I wonder if there is some general idea(s) what tasks can benefit from parallel use. More specifically, could compute-intensive and long-running jobs like BLAST, alignment or de novo assembly benefit from parallel?
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Charlottesville, VA Join Date: May 2011
Posts: 112
|
![]()
Parallel won't parallelize an intrinsically serial job, but very easily allows you to launch many serial jobs in parallel. I use it all the time to run an operation on lots of files by using something like, e.g.:
Code:
find *.fq | parallel fastqc {} --outdir . # run fastqc on all .fq files find *.bam | parallel samtools index {} # index all bam files |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
This looks perfect. I've got my own homebrewed program I called "tetris" which does the same thing but I'll definitely switch to this.
Note the --max-procs parameter which throttles the serialized jobs to only use the specified amount of CPUs. Anybody hooked this thing up to "gnu niceload"? Any examples? |
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Oregon Join Date: Apr 2011
Posts: 205
|
![]()
Interesting. Little bit off the topic, but I encountered strange difference in CPU use with fastqc. I made a small script to process 10 files at once (the box has 2 quad core processors with multithreading enabled, that is 8 physical cores and 16 threads), like
fastqc -t 10 [file1 ... file10] When I launched the script CPU(s) got only to something 26us% in top. But when I just copied the above task to the command line, CPU(s) jumped to something 85us% in top. What may be the reason for the difference? Did you notice something like that with parallel? |
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: Denmark Join Date: Feb 2013
Posts: 7
|
![]()
If used for research please remember:
Code:
parallel --bibtex |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Denmark Join Date: Apr 2009
Posts: 153
|
![]()
@yaximik I see these major benefits of parallel: 1) use parallel instead of a for; do & done; loop to execute some command in parallel in a way that optimizes the CPU usage (parallel cleverly decides to wait for jobs to complete before starting new jobs without flooding the machine). 2) use the parallel --pipe to parallelize the processing of huge files. 3) combine 1) and 2). And then there are all the other things that parallel can do for you.
Last edited by maasha; 02-14-2013 at 10:53 AM. |
![]() |
![]() |
![]() |
#10 |
Member
Location: India Join Date: Jun 2011
Posts: 26
|
![]()
I have been using it for the past 6-8 months. I feel very happy when I am able to run my jobs using parallel, because just saves a hell lot of time. Actually it helps in best utilization of the computational facilities you have.
Here is an example of the time that I save normally: If I have to convert around 8 sam files to bam files, say it generally takes 8min for one file conversion. In serial it would take 64min, but when I run on cluster using GNU parallel, it just takes ~8min. |
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Denmark Join Date: Apr 2009
Posts: 153
|
![]()
Over at Biostars there is this tool description.
|
![]() |
![]() |
![]() |
#12 | |
Senior Member
Location: Oregon Join Date: Apr 2011
Posts: 205
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#13 |
Member
Location: Berkeley, CA Join Date: May 2010
Posts: 50
|
![]()
Another nice thing about parallel is that it makes it easy to generate filenames in an intelligent way. Say you want to convert a bunch of bam files to sam files, you can easily do:
parallel 'samtools view -h -o {.}.sam {}' ::: *.bam which does exactly what you want, instead of potentially ending up with .bam.sam or the like. That's just a trivial example (and possibly not correct, I never exactly remember the syntax), and there's a lot more you can do with it. |
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Oregon Join Date: Apr 2011
Posts: 205
|
![]()
I tried to run conversion between two assembly fomats using parallel and amos2ace, but got an error:
Code:
$ cat /home/yaximik/AssRefMap/SC/Ray/RayOutput/AMOS.afg | parallel --block 100M -k --pipe --recstart '{' --recend '}' amos2ace > /home/yaximik/AssRefMap/SC/Ray/RayOutput/AMOS.ace substr outside of string at /usr/bin/parallel line 333. |
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: Denmark Join Date: Apr 2009
Posts: 153
|
![]()
@yaximik
New questions in new threads. Do your homework first: read this: http://www.ploscompbiol.org/article/...l.pcbi.1002202 and then man parallel Notice the section: Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|