SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
searching tool for ChIP-seq analysis of histone modification analysis tujchl Epigenetics 5 03-13-2013 10:00 AM
Variation analysis - tool recommendation moty Bioinformatics 5 11-28-2011 06:52 AM
Performance bottlenecks in sequence analysis pipelines DanielChubb Bioinformatics 1 11-04-2011 08:34 AM
Softwares/Pipelines for ChIP Seq Analysis ayushraman Bioinformatics 0 09-16-2011 05:37 PM

Reply
 
Thread Tools
Old 04-16-2012, 05:58 PM   #1
simonzmmmmm
Junior Member
 
Location: Melbourne, Australia

Join Date: Mar 2012
Posts: 7
Default Bpipe: a new tool for running analysis pipelines

Hello all,

I would like to let everyone know about Bpipe, a new tool we have created to help run bioinformatics pipelines.

Many people will be familiar with tools like Galaxy and Taverna, etc. that help you run pipelines and give a graphical view of the pipeline, inputs, outputs and many other features to make analysis pipelines more robust and manageable. Bpipe is similar in many ways but aimed at users who are command line oriented. It lets you write your pipelines almost like how you write a shell script, but it automatically adds features such as:
  • Transactional management of tasks - commands that fail get outputs cleaned up, log files saved and the pipeline cleanly aborted.
  • Automatic connection of pipeline stages - Bpipe manages the file names for input and output of each stage in a systematic way so that you don't need to think about it
  • Easy stopping or restarting - when a job fails it is easy to cleanly restart from the point of failure
  • Audit trail - Bpipe keeps a journal of exactly which commands executed and what their inputs and outputs were
  • Modularity - It's easy to make a library of pipeline stages (or commands) that you frequently use and mix and match them in different pipelines
  • Parallelism - easily run many samples/files at the same time or split one sample and run analysis on many parts of it in parallel
  • Integration with cluster resource managers - Bpipe supports PBS/Torque and more systems can be added easily
  • Notifications - Bpipe can send you alerts by email or instant message to tell you when your pipeline finishes or even as each stage completes.
Bpipe is BSD licensed and available, along with documentation and examples, at http://bpipe.org. We also have a publication accepted in Bioinformatics which may be of interest as well.

Bpipe is very young and I hope to make many improvements, so I would love to have feedback from anybody here about it.

Thanks!

Simon
simonzmmmmm is offline   Reply With Quote
Old 04-16-2012, 07:30 PM   #2
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

I think this is a great tool for building pipelines. Thanks for sharing. I think it would be awesome if we could eventually adopt a similar approach to be analysis framework, where if tools are written as plugins, a BAM file could open ed once and then operated on with several plugins, and written once.

So much time is wasted on writing a BAM file over and over.
adaptivegenome is offline   Reply With Quote
Old 04-17-2012, 03:35 AM   #3
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by genericforms View Post
So much time is wasted on writing a BAM file over and over.
That's why you should use Unix pipes where possible (ideally without compressing the intermediate BAM files, use -u in samtools). Does Bpipe support this? Perhaps it could using named pipes?
maubp is offline   Reply With Quote
Old 04-17-2012, 05:34 AM   #4
brentp
Member
 
Location: salt lake city, UT

Join Date: Apr 2010
Posts: 72
Default

Looks pretty useful, could you explain more about:
http://code.google.com/p/bpipe/wiki/...pelineTutorial

where, for example you have:

Code:
@Transform("bai")
index = {
        exec "samtools index $input"
        return input
}
how does it know which $input to use? Does each step use the $output
from the previous?
what if a given step needs multiple previous $output' s?
I guess it's not clear to me what's going on with the the $input and $output names.
brentp is offline   Reply With Quote
Old 04-17-2012, 05:39 AM   #5
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by maubp View Post
That's why you should use Unix pipes where possible (ideally without compressing the intermediate BAM files, use -u in samtools). Does Bpipe support this? Perhaps it could using named pipes?
So thinking about the process from streaming the output SAM from the mapper all the way to the final step in which a recalibrated and realigned BAM is ready for mutation calling, is it is simply sufficient to pipe all the intermediary steps?

If so, then this would negate the need for plug-ins.
adaptivegenome is offline   Reply With Quote
Old 04-17-2012, 05:44 AM   #6
brentp
Member
 
Location: salt lake city, UT

Join Date: Apr 2010
Posts: 72
Default

Quote:
Originally Posted by genericforms View Post
If so, then this would negate the need for plug-ins.
only if you don't care about keeping around intermediate files, in case the pipeline adds a step in there or you change some parameters in a later step and don't want to re-run the entire pipeline.
brentp is offline   Reply With Quote
Old 04-17-2012, 06:14 AM   #7
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by brentp View Post
only if you don't care about keeping around intermediate files, in case the pipeline adds a step in there or you change some parameters in a later step and don't want to re-run the entire pipeline.
Yes I agree. I guess I was wondering if there was a rationale for a more complex plugin framework. Seems like for the most part there is not one.
adaptivegenome is offline   Reply With Quote
Old 04-17-2012, 04:08 PM   #8
simonzmmmmm
Junior Member
 
Location: Melbourne, Australia

Join Date: Mar 2012
Posts: 7
Default

Hi maubp,
Quote:
Originally Posted by maubp View Post
That's why you should use Unix pipes where possible (ideally without compressing the intermediate BAM files, use -u in samtools).

Does Bpipe support this? Perhaps it could using named pipes?
Bpipe is file oriented so it does expect to see file at the output of each stage. In my usage, a single pipeline "stage" will often be several parts of the process piped together, and then the output of that arrives as a BAM file that is sort like a "checkpoint". That lets you restart or rerun parts of the analysis again from there. So you're not storing a BAM file for every single part of the process, but having them at several points in between is useful nonetheless. There is an open issue sort of related to this.

Named pipes is a really interesting idea. I think at the moment it would be problematic because Bpipe expects the process for a pipeline stage to terminate before it will initiate the next stage. But with some tweaks that could be relaxed, to allow this mode of operation.

I'll definitely put more thought into this - thanks for the discussion / ideas!
simonzmmmmm is offline   Reply With Quote
Old 04-17-2012, 04:25 PM   #9
simonzmmmmm
Junior Member
 
Location: Melbourne, Australia

Join Date: Mar 2012
Posts: 7
Default

Hi brentp,
Quote:
Originally Posted by brentp View Post
Looks pretty useful, could you explain more about:
http://code.google.com/p/bpipe/wiki/...pelineTutorial

where, for example you have:

Code:
@Transform("bai")
index = {
        exec "samtools index $input"
        return input
}
how does it know which $input to use? Does each step use the $output
from the previous?
This is the default; if you do nothing else, the output from a previous stage becomes the $input variable for the next stage.
Quote:
what if a given step needs multiple previous $output' s?
Bpipe gives you a sort of "query language" to easily get back to any of the previous outputs. You can think of it as querying the tree of outputs in reverse using a very simple syntax, (but it is so simple that this is more of a mental model than a reality). So suppose you need the VCF file from a previous stage and a BAM file, and they are not already the default input. You can get at them like this:
Code:
exec "somecommand $input.vcf $input.bam"
Which will expand to:
Code:
exec "somecommand most_recent_vcf.vcf most_recent_bam.bam"
If you want all the BAMs from the most recent pipeline stage that produced a BAM file:
Code:
exec "somecommand $inputs.bam"
(Notice the "input" has become "inputs"). The above will seach backward through pipeline stages until it finds a stage that produced one or more BAM files. Then it will expand to:
Code:
exec "somecommand file1.bam file2.bam file3.bam ..."
You can use all the normal BASH constructs inside your commands too - so if you want to index every bam file:
Code:
exec "for i in $inputs.bam; do samtools index $i; done"
(You'd probably want to do this a bit smarter and run them in parallel, but just for the sake of example).

Cheers,

Simon
simonzmmmmm is offline   Reply With Quote
Old 04-19-2012, 02:45 AM   #10
linusvanpelt
Junior Member
 
Location: La Jolla

Join Date: Apr 2011
Posts: 5
Default

Hi Simon,

I am trying to do some parallelization with bpipe and hope you can help me out on a problem. Like in this example from the Wiki

Code:
Bpipe.run {
  chr(1..5) * [ hello ]
}
I would like to use the concept more general to do a parallelization task like this:

Code:
@Transform("sam")
align_stampy = {
        exec """
           python $STAMPY_HOME/stampy.py  
           --bwaoptions="-q10 $REFERENCE" 
           -g $STAMPY_GENOME_INDEX
           -h $STAMPY_HASH_FILE
           -M $input1,$input2
           -o $output 
           --readgroup=ID:$rg_id,LB:$rg_lb,PL:$rg_pl,PU:$rg_pu,SM:$rg_sm
	   --processpart=$part
           """
}

Bpipe.run {
    part("1/3", "2/3", "3/3") * [align_stampy]
}
Knowing that this is not working I was wondering if this could be implemented or resolved somehow?

Thanks,
Tobias
linusvanpelt is offline   Reply With Quote
Old 04-20-2012, 06:50 AM   #11
simonzmmmmm
Junior Member
 
Location: Melbourne, Australia

Join Date: Mar 2012
Posts: 7
Default

Hi Tobias,
Quote:
Originally Posted by linusvanpelt View Post
Hi Simon,

I am trying to do some parallelization with bpipe and hope you can help me out on a problem. Like in this example from the Wiki

Code:
Bpipe.run {
  chr(1..5) * [ hello ]
}
I would like to use the concept more general to do a parallelization task like this:
...

Knowing that this is not working I was wondering if this could be implemented or resolved somehow?
I was thinking exactly the same thought while I was implementing this feature - the ways of splitting things up to parallelize are quite arbitrary (by gene, by exon, by any arbitrary genomic coordinates, by anything at all ...). In the interest of expediency I made the first implementation specific to chromosome just to try out the idea, but I will definitely pursue a more generalized form of it. I've added an enhancement issue to track this so that you can get notified when progress is made on it:

http://code.google.com/p/bpipe/issue...&ts=1334929751

Thanks for the feedback!

Simon
simonzmmmmm is offline   Reply With Quote
Old 04-20-2012, 07:24 AM   #12
linusvanpelt
Junior Member
 
Location: La Jolla

Join Date: Apr 2011
Posts: 5
Default

Hi Simon,

thx for adding this issue to you develpment queue. I will you on this..

Tobias
linusvanpelt is offline   Reply With Quote
Old 04-20-2012, 08:21 AM   #13
Alex Renwick
Member
 
Location: Houston, Texas

Join Date: Jul 2011
Posts: 44
Default

I do a lot of shell scripts and make files, and this Bpipe looks like it will make my life much easier.

I have a problem using the torque queue. The process does not return after successfully executing one statement. For example, if I give it...

Code:
echo = {
  exec "echo this"
  exec "echo that"
}
Bpipe.run { echo }
...it would print "this" and then wait forever.

This only happens when using torque. Incidentally, in order to get the queue to run had to export the QUEUE shell variable set to the name of the pbs queue.

Any ideas about how to get it to work?

Edit:

I figured out how to get it to work. I configured the queue to keep complete jobs for a minute. It had been removing jobs from the queue immediately after completion, so the job status would never be shown as "completed". Now it's fixed.

Last edited by Alex Renwick; 04-20-2012 at 08:58 AM.
Alex Renwick is offline   Reply With Quote
Old 05-04-2012, 01:54 AM   #14
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

Nice job Simon, we like this tool very much. The wiki is great.
colindaven is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO