SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bfast parallelization genome_anawk1 Bioinformatics 9 05-17-2012 04:36 PM
Looking for a few NGS-ers willing to share a bad experience about NGS data analysis CHoyt Bioinformatics 8 12-09-2011 11:06 PM
Strand SI introduces Avadis NGS. NGS analysis for the rest of us! Strand SI Vendor Forum 0 02-14-2011 10:19 AM
Support for parallelization of paired-end alignments with BWA Fabien Campagne Bioinformatics 0 12-17-2010 04:40 AM

Reply
 
Thread Tools
Old 08-11-2012, 04:52 PM   #1
jtietjen
Junior Member
 
Location: San Diego, California

Join Date: Aug 2012
Posts: 2
Lightbulb Extreme parallelization for NGS analysis

I'd like to start an open discussion on the topic of parallelization for NGS data. I noticed that Galaxy recently came out with a cloud-based interface using Amazon EC3. I've been trying to learn more about how these NGS analysis algorithms (for alignment, assemly, etc.) are actually implemented in a parallel fashion, but I have had trouble finding specific documentation and resources describing how it works and how it is implemented. Any direction/resources that people can provide would be much appreciated.

Also, I have seen some papers describing parallelization of various specific algorithms, especially recently (such as PASQUAL from Georgia Tech), but they all seem to be operating on relatively "small" networks of distributed computing resources. Does anyone have any idea about how far the parallelization and speeding up of these analyses can be pushed? How difficult would it to be to implement something that runs on a distributed network of say 100,000 computers, or even more... say a million? Is there a bottleneck somewhere that would prevent that from being feasible for NGS analysis? Or would that make the analyses amazingly fast compared to what's available now? I'm thinking of a system like what the SETI project has set up for their distributed computing user base and wondering what the limits are and how one could implement such a system if the user base is already in place.
jtietjen is offline   Reply With Quote
Old 08-11-2012, 04:53 PM   #2
jtietjen
Junior Member
 
Location: San Diego, California

Join Date: Aug 2012
Posts: 2
Default

I realized after posting that people might begin to point out that other threads exist on specific NGS analysis algorithms for parallelization, but I decided to leave my thread very open ended because in the end, the system I have in mind should work for any and all current analysis/data processing methods.
jtietjen is offline   Reply With Quote
Old 08-13-2012, 02:25 AM   #3
xied75
Senior Member
 
Location: Oxford

Join Date: Feb 2012
Posts: 129
Default

NGS mostly are text processing (doesn't matter if binary or compressed), so I/O is the bottleneck (no matter in house or to the Internet).

SETI (or maybe Folding@Home), a small data file will make CPU happy for a while.

Cloud (Amazon or whatever), is a business model that buy large amount of white box servers and rent out in 1 hour unit, it does not use fancy hardware, it does not upgrade until the previous investment is back.

So today's situation is like this:
1, for a 4TB harddrive, you can only get 100MB/s sequential read out of it.
2, you might have a PB sized array in house, but you only have 1Gb Internet connection to the world.
3, this won't change for some years.
4, LHC's infrastructure, is the extreme/limit for now, anything they can't do/afford, no one can.
xied75 is offline   Reply With Quote
Old 08-13-2012, 08:50 AM   #4
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

1. This can change now if you have $$$
2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
3. InfiniBand for 300Gbps network
ymc is offline   Reply With Quote
Old 08-13-2012, 12:24 PM   #5
xied75
Senior Member
 
Location: Oxford

Join Date: Feb 2012
Posts: 129
Default

Quote:
Originally Posted by ymc View Post
2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
No no that's not my point. I would rather say you can get 2500MB/s random read (maybe, I don't have these to play with.)

Quote:
Originally Posted by ymc View Post
3. InfiniBand for 300Gbps network
No no again, I was talking about Internet connection, the thread is asking about Cloud, (unless Private Cloud is also included in the discussion.)
xied75 is offline   Reply With Quote
Old 09-12-2012, 06:23 PM   #6
kevyin
Junior Member
 
Location: Sydney

Join Date: Jul 2011
Posts: 2
Default

There are links here on deploying galaxy in a cluster (and other things)

http://wiki.g2.bx.psu.edu/Admin/Conf...%2FPerformance

We have this deployed on our cluster and jobs are basically distributed to cluster nodes by the Sun Grid Engine.

It's up to the tools themselves to do MPI/threading etc.

In a cloud setting, NGS data can get quite large so storage may be an issue
kevyin is offline   Reply With Quote
Reply

Tags
cloud computing, distributed computing, ngs data analysis, parallelization

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:14 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO