SEQanswers (
-   Bioinformatics (
-   -   Extreme parallelization for NGS analysis (

jtietjen 08-11-2012 05:52 PM

Extreme parallelization for NGS analysis
I'd like to start an open discussion on the topic of parallelization for NGS data. I noticed that Galaxy recently came out with a cloud-based interface using Amazon EC3. I've been trying to learn more about how these NGS analysis algorithms (for alignment, assemly, etc.) are actually implemented in a parallel fashion, but I have had trouble finding specific documentation and resources describing how it works and how it is implemented. Any direction/resources that people can provide would be much appreciated. :)

Also, I have seen some papers describing parallelization of various specific algorithms, especially recently (such as PASQUAL from Georgia Tech), but they all seem to be operating on relatively "small" networks of distributed computing resources. Does anyone have any idea about how far the parallelization and speeding up of these analyses can be pushed? How difficult would it to be to implement something that runs on a distributed network of say 100,000 computers, or even more... say a million? Is there a bottleneck somewhere that would prevent that from being feasible for NGS analysis? Or would that make the analyses amazingly fast compared to what's available now? I'm thinking of a system like what the SETI project has set up for their distributed computing user base and wondering what the limits are and how one could implement such a system if the user base is already in place.

jtietjen 08-11-2012 05:53 PM

I realized after posting that people might begin to point out that other threads exist on specific NGS analysis algorithms for parallelization, but I decided to leave my thread very open ended because in the end, the system I have in mind should work for any and all current analysis/data processing methods.

xied75 08-13-2012 03:25 AM

NGS mostly are text processing (doesn't matter if binary or compressed), so I/O is the bottleneck (no matter in house or to the Internet).

SETI (or maybe Folding@Home), a small data file will make CPU happy for a while.

Cloud (Amazon or whatever), is a business model that buy large amount of white box servers and rent out in 1 hour unit, it does not use fancy hardware, it does not upgrade until the previous investment is back.

So today's situation is like this:
1, for a 4TB harddrive, you can only get 100MB/s sequential read out of it.
2, you might have a PB sized array in house, but you only have 1Gb Internet connection to the world.
3, this won't change for some years.
4, LHC's infrastructure, is the extreme/limit for now, anything they can't do/afford, no one can.

ymc 08-13-2012 09:50 AM

1. This can change now if you have $$$ ;)
2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
3. InfiniBand for 300Gbps network

xied75 08-13-2012 01:24 PM


Originally Posted by ymc (Post 81292)
2. For eight SSDs in RAID0, you can get 2500MB/s sequential read

No no that's not my point. I would rather say you can get 2500MB/s random read (maybe, I don't have these to play with.)


Originally Posted by ymc (Post 81292)
3. InfiniBand for 300Gbps network

No no again, I was talking about Internet connection, the thread is asking about Cloud, (unless Private Cloud is also included in the discussion.)

kevyin 09-12-2012 07:23 PM

There are links here on deploying galaxy in a cluster (and other things)

We have this deployed on our cluster and jobs are basically distributed to cluster nodes by the Sun Grid Engine.

It's up to the tools themselves to do MPI/threading etc.

In a cloud setting, NGS data can get quite large so storage may be an issue

All times are GMT -8. The time now is 08:52 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.