SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CASAVA v1.8 with indels tonio100680 Bioinformatics 3 08-19-2011 05:53 AM
Difference between TopHat and Illumina-CASAVA ParthavJailwala Bioinformatics 5 08-11-2011 12:32 PM
Demultiplexing and CASAVA 1.7 tonio100680 Bioinformatics 14 06-16-2011 11:48 PM
Generating a BAM file from Illumina export files in CASAVa 1.7 nirav99 Bioinformatics 1 09-10-2010 02:20 AM
parallelizing MAQ & BWA mhmckm Bioinformatics 11 01-11-2010 11:36 PM

Reply
 
Thread Tools
Old 10-25-2010, 04:10 PM   #1
Bustard
Junior Member
 
Location: San Francisco Bay Area

Join Date: Oct 2010
Posts: 3
Default Parallelizing GEARLD in Illumina CASAVA 1.7

Hi, I am interested in how others are accelerating their reassemblies with GEALRD when processing HiSeq 2000 data.

Specifically, parallel make will work perfectly fine on a multicore server. For instance we typically run 8 lanes of PE GAIIx data in 24 hours using a 8 core Xeon server.

The HiSeq data presents new challenges with processing taking nearly a week to process on 8 lanes on an 8 core server.

We are looking at purchasing a 32 or 64 core SMP like server, but are also interested in whether folks are taking advantage of beowulf clusters and dividing the reassembly across nodes. We have considered running one lane/node as an alternative, but this presents some overhead for collating the data once the runs are complete.

Some folks have mentioned using Sun Grid Engine and qmake to divide up the problem. We use PBS Pro so face some issues with porting this.

Can anyone comment on how they accelerated their implementation of the reassembly pipeline?

Thanks
Bustard is offline   Reply With Quote
Old 10-25-2010, 10:17 PM   #2
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

On modern SMP CPU you can safely run 2n + 1 concurrent jobs and still have a working machine. In your case you can issue

Code:
$ make -j 17
The only problem is the I/O. Indeed you will likely find many "D" processes (uninterruptible sleep) because they are stuck on read/write.
Although I can work with SGE, I don't process my illumina data there, because the underlying network file system is too slow for me (it' a small NFS based cluster).
Oh, I should add I don't own a Hiseq, and I don't know how the shipped disk system works.
HTH
d
dawe is offline   Reply With Quote
Old 10-26-2010, 05:22 AM   #3
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

The easiest way is to use sun grid engine in your cluster. If that is not an option you can use qmake. Either way your return times for the analysis are going to go up for the HiSeqs compared to the GAIIs.

Also, as dawe mentions, pay special attention to your storage system. If you don't purchase the proper hardware and don't set it up correctly you can end up increasing
the running times.

Another alternative (that's what I'd suggest) is to switch to bwa for your alignments (It is
extremely IO friendly and very accurate). Still generate the GApipeline stats but skip the alignments. At the end of the pipeline fire up bwa and then compute any stats you want from the BAM. If you don't want to code something up, Picard comes now with a bunch of cmds to extract different stats from BAM files.
__________________
-drd
drio is offline   Reply With Quote
Old 10-26-2010, 05:27 AM   #4
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Quote:
Originally Posted by drio View Post
Another alternative (that's what I'd suggest) is to switch to bwa for your alignments (It is
extremely IO friendly and very accurate). Still generate the GApipeline stats but skip the alignments. At the end of the pipeline fire up bwa and then compute any stats you want from the BAM. If you don't want to code something up, Picard comes now with a bunch of cmds to extract different stats from BAM files.
I definitely agree! We run eland only when somebody asks specifically for eland_export files... other aligners perform much better in terms of running time and, most important, precision.
d
dawe is offline   Reply With Quote
Old 10-26-2010, 06:48 AM   #5
SillyPoint
Member
 
Location: Frederick MD, USA

Join Date: May 2008
Posts: 39
Default Paging?

In the situation you describe, I'd be a little suspicious about memory usage and paging. Because the tiles are 10 times larger, the Illumina pipeline uses a lot more memory for a HiSeq run than for GA2 (in general -- not sure about Gerald specifically). If you have more data than RAM, the operating system will happily spend its time thrashing data between RAM and swap space.

Have a look at swap space usage with top. Also look at CPU usage: alignment should be pretty CPU-bound. If there's a lot of i/o waiting happening, it's probably paging i/o.

--TS
SillyPoint is offline   Reply With Quote
Old 10-26-2010, 09:01 AM   #6
Bustard
Junior Member
 
Location: San Francisco Bay Area

Join Date: Oct 2010
Posts: 3
Default

Thanks all for the replies.

So certainly I am pursuing the use of 2n SMP using HyperThreading. Our Nehelem processors are certainly up to the task and that is one avenue we are exploring. The +1 is interesting. Any rational why that extra process is possible? Is it a parent housekeeping process?

Also, we have 500TB of Isilon IQ36K storage connected to the cluster via 10Gb/e. It is NFS mounted, but we have good bandwidth (but the latency of TCP). There are 21 nodes in the cluster storage, and we see throughput of around 200MB/s, so no worries there (I presume).

Our nodes have 192GB of RAM as well. With 2n jobs on a node that's probably 11GB/process after subtracting OS overhead. Any thoughts on that being sufficient?

I like the idea of exploring BWA for alignments. I would just need to be confident our results are on par with ELAND/GEARLD. But that's a great idea.

Has anyone successfully spread these alignment jobs across separate cluster nodes?

Thanks again for all the replies.
Bustard is offline   Reply With Quote
Old 10-26-2010, 09:55 AM   #7
SillyPoint
Member
 
Location: Frederick MD, USA

Join Date: May 2008
Posts: 39
Default

To paraphase Mr Gates, "192GB oughta be enough for anybody." Can't believe you're paging with that much RAM.

--TS
SillyPoint is offline   Reply With Quote
Old 10-26-2010, 10:46 AM   #8
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Quote:
Originally Posted by Bustard View Post
Thanks all for the replies.

So certainly I am pursuing the use of 2n SMP using HyperThreading. Our Nehelem processors are certainly up to the task and that is one avenue we are exploring. The +1 is interesting. Any rational why that extra process is possible? Is it a parent housekeeping process?
Mmm... I guess it's something I've inherited when I used gentoo linux. In principle you can have two threads per processor plus one "spinning around" to push some queue :-) BTW, you can also add

Code:
-l FLOAT
to make arguments, you can specify the maximum load for your machine (a 2.5 load should be enough)

Quote:
Originally Posted by Bustard View Post
Our nodes have 192GB of RAM as well. With 2n jobs on a node that's probably 11GB/process after subtracting OS overhead. Any thoughts on that being sufficient?
Lucky you! I still don't have HiSeq data, but I guess 11 Gb/process is far from being optimal. I guess you can optimize the ELAND_SET_SIZE parameter in your GERALD config file. From the CASAVA manual (1.6):
"CASAVA requires a minimum of 2GB RAM per core for a 50G run. The parameter ELAND_SET_SIZE in the GERALD config.txt specifies the maximum number of tiles aligned by each ELAND process. The default value is 40 which should keep the peak memory consumption below 2GB for a 50G run."

and

"The default value is 40 to ensure that the memory usage stays below 2 GB for a full 50G run
(450,000 clusters/mm2, 2 x 100 paired-end run). Only available for ANALYSIS eland_extended, ANALYSIS eland_pair, and ANALYSIS eland_rna."
dawe is offline   Reply With Quote
Old 10-27-2010, 05:49 AM   #9
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by Bustard View Post
So certainly I am pursuing the use of 2n SMP using HyperThreading. Our Nehelem processors are certainly up to the task and that is one avenue we are exploring. The +1 is interesting. Any rational why that extra process is possible? Is it a parent housekeeping process?
It all depends on your storage system and how it can keep up with your processes. Use different j values and plot something like this: http://shell2.reverse.net/~drio/bfast/bf.top/ When the pipeline is performing the alignment you should see your cpus at idle 0. THe data comes from vmstat.
Quote:
Originally Posted by Bustard View Post
Also, we have 500TB of Isilon IQ36K storage connected to the cluster via 10Gb/e. It is NFS mounted, but we have good bandwidth (but the latency of TCP). There are 21 nodes in the cluster storage, and we see throughput of around 200MB/s, so no worries there (I presume).
200MB/s at what stage of the execution in the pipeline? Are all the nodes computing?
Quote:
Originally Posted by Bustard View Post
Our nodes have 192GB of RAM as well. With 2n jobs on a node that's probably 11GB/process after subtracting OS overhead. Any thoughts on that being sufficient?
So I assume you are running 1 analysis per lane on 1 node correct? What is the running time of that?
Why that much amount of RAM?

Quote:
Originally Posted by Bustard View Post
I like the idea of exploring BWA for alignments. I would just need to be confident our results are on par with ELAND/GEARLD. But that's a great idea.
You'll get better alignments. Also a SAM. The SAM generation (at least three months ago) in the GApipeline
was a little bit messed up.

Quote:
Originally Posted by Bustard View Post
Has anyone successfully spread these alignment jobs across separate cluster nodes?
Have you explore the possibility of installing SGE in your cluster? With that you'll be able to
just run sge-make and SGE will paralelize the execution with the maximum granularity possible.

If that is not an option, you'll have to carefully study the targets in the GApipeline Makefile and
build a script that sets the depencencies among targets, as well as the resources at each step.
That can be time consuming (and certainly boring). Also, I don't recommend it because Illumina can
change the targets/makefile (and they will) then your pipeline will break.
__________________
-drd
drio is offline   Reply With Quote
Old 10-28-2010, 10:16 AM   #10
Bustard
Junior Member
 
Location: San Francisco Bay Area

Join Date: Oct 2010
Posts: 3
Default

Thanks for the tips Drio, here are some responses to your questions:

Quote:
Originally Posted by drio View Post
200MB/s at what stage of the execution in the pipeline? Are all the nodes computing?
200MB/s was recorded via a synthetic throughput test, IOZone, so not an actual measurement during processing, just an upper bound on performance.

Quote:
Originally Posted by drio View Post
So I assume you are running 1 analysis per lane on 1 node correct? What is the running time of that?
No, we have typically ran (with GAIIx data) all lanes on a single node. I am exploring the fragmentation of the work like you suggest, 1 lane/node.

Quote:
Originally Posted by drio View Post
Why that much amount of RAM?
This is a general use cluster, so lots of folks running R jobs post processing. We didn't get that much RAM just for the sequencing pipeline. As you can imagine this doesn't get used much, but there are peaks in folks work that do approach 128GB+

Quote:
Originally Posted by drio View Post
Have you explore the possibility of installing SGE in your cluster?
No, not to date, but again we have a general use cluster so we don't want to disrupt it by yanking PBS Pro. I am exploring the idea of using it on a separate cluster devoted to the sequencing pipeline.

Quote:
Originally Posted by drio View Post
Also, I don't recommend it because Illumina can
change the targets/makefile (and they will) then your pipeline will break.
I agree. Not a great workaround and fragile at that.

What we are exploring is the use of SMP like servers such as the HP DL580 G6 with 32 cores and 512GB of RAM. This may very well be our sweet spot without too much extra work dividing work or changing the pipeline.

Thanks again for the input.
Bustard is offline   Reply With Quote
Reply

Tags
gealrd cluster hiseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:52 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO