SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
dspchip: a digital signal processing approach to chip-seq analysis dawe Bioinformatics 0 02-16-2011 10:27 AM
PubMed: Massively Parallel Signature Sequencing and Bioinformatics Analysis Identifie Newsbot! Literature Watch 0 04-27-2010 03:00 AM
MOSAIK - parallel protein sequence guided assembly? liborm Bioinformatics 0 02-08-2010 01:17 PM
Parallel, tag-directed assembly of locally derived short sequence reads. krobison Literature Watch 0 01-31-2010 08:24 PM
PubMed: ABySS: A parallel assembler for short read sequence data. Newsbot! Literature Watch 0 03-03-2009 06:00 AM

Reply
 
Thread Tools
Old 02-04-2009, 07:45 AM   #1
jperin
Member
 
Location: Philadelphia

Join Date: Feb 2009
Posts: 10
Default Parallel Processing for Sequence Analysis

Hello,

I'm fairly new here and have been trying to get our systems configured properly for NGS analysis. I'm primarily concerned with ABi CS data, but will also be involved quite heavily with Solexa as well. Corona has its own built-in tools for configuring they're applications to run on top of Torqure PBS for processing on a cluster, this seems to work quite well. I've been searching for other options and am not finding very much. Solexa's GAPipeline appears to have some basic tools for parallelization, but we're not big fans of ELAND and would prefer to use MAQ or Bowtie for alignments. These two tools don't seem to have much information on methods for batch job submission.

I'm hoping to get some feedback from anyone with more experience, in ways to either parallelize MAQ, Bowtie, etc... or for ways to, at least, break up the jobs so that they can be submitted in a naively parallel fashion. Thanks in advance!
jperin is offline   Reply With Quote
Old 02-04-2009, 08:08 AM   #2
apfejes
Senior Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 236
Default

I'm probably the wrong person to attempt to answer your question, but as far as I know, we just run each lane through maq one at a time, then use mapmerge to assemble libraries back together. Thus, we often have eight maq jobs running at a time on the cluster, for each machine in operation. Again, I'm not the person who submits the jobs, so other people can probably provide more information than I can.

Sequence alignment theoretically belongs to the class of algorithms known as embarrassingly parallelizable... each sequence could theoretically be aligned by a separate computer and then recombined. The question should just be what is the optimal number of reads to align by each instance... and that I dont' know. (-:
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 02-04-2009, 08:14 AM   #3
jperin
Member
 
Location: Philadelphia

Join Date: Feb 2009
Posts: 10
Default

Hm. The idea of separating lanes is good. I am familiar with most embarrassingly parallel methods for sequence analysis, but was hoping there might be some established methods specifically for NGS that have been developed. I am particularly interested in setting up a few processing pipelines that can be triggered (relatively automatically) and then run across our cluster system, then packaged up for post processing and results delivery.

Tools like the corona pipeline are ideal because they are pre-configured to do so off the bat. MAQ would require some initial configuration and some scripts here and there to accomplish this. I guess a generic tool for parallelizing things may be too much to ask for, but aside from splitting up lanes, or splitting up each individual alignment task, I'm wondering what else might be able to work? Bowtie has methods for splitting up across multiple cores, using the '-p' option, and I would hope that this can somehow be leveraged to cross multiple systems as well. But that's where I start to get lost, and find myself trying to figure out the code at a much lower level, which is going to take me a very long time to solve...
jperin is offline   Reply With Quote
Old 02-04-2009, 02:39 PM   #4
Ben Langmead
Senior Member
 
Location: Baltimore, MD

Join Date: Sep 2008
Posts: 200
Default

Hi jperin,

With respect to Bowtie, the -p option allows you to parallelize Bowtie in the sense of using multiple threads (which are hopefully mapped to multiple processor cores) on a single machine. For parallelizing across machines, I do not really have a pre-fab set of scripts for that. As an aside, I'm currently doing some work on getting Bowtie to work in a Cloud Computing framework, specifically using Hadoop. This would allow Bowtie to be parallelized across any cluster that has Hadoop installed, including Amazon's EC2 service. That's not ready for prime time yet, though.

Thanks,
Ben
Ben Langmead is offline   Reply With Quote
Old 02-04-2009, 04:21 PM   #5
vruotti
Member
 
Location: US

Join Date: Feb 2008
Posts: 13
Default MAQ on cluster

A few comments here.
Here is a nice trick posted by Quang.


Hi Victor,
We use "maq fastq2bfq -n 1000000 ..." to split the reads.
....

Q

More here.
http://groups.google.com/group/sge-l...f3a6f6b501240c
vruotti is offline   Reply With Quote
Old 02-05-2009, 06:48 AM   #6
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by jperin View Post
Tools like the corona pipeline are ideal because they are pre-configured to do so off the bat. MAQ would require some initial configuration and some scripts here and there to accomplish this. I guess a generic tool for parallelizing things may be too much to ask for, but aside from splitting up lanes, or splitting up each individual alignment task, I'm wondering what else might be able to work?
As far as I know the Corona pipeline does not do anything fancy. All it does is to split up the alignment task using the chromosomes with one CPU per 'chromosome' (note that a 'chromosome' could be a single contig/BAC/etc. depending on your organism). If you have single chromosome then Corona will only use one CPU.

I could be running Corona lite improperly in which case let me know! But my experience is that Corona does not employ anything more than the same-old-same-old embarrassingly parallel methods.
westerman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:52 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO