SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Library quantification: opinions? krobison Sample Prep / Library Generation 41 06-23-2016 06:38 PM
What are the differences between alignment, mapping, and assembly in Bioinfomatics? Felix0902 Bioinformatics 15 05-23-2016 03:10 AM
need a bioinfomatics primer! biophy6 General 11 02-07-2012 06:21 AM
Opinions about using Newbler with the -rip dschika Bioinformatics 2 11-11-2010 09:07 PM
OPINIONS NEEDED; Reorganization of Bioinformatics Forum ECO Bioinformatics 14 07-02-2010 05:24 AM

Reply
 
Thread Tools
Old 06-28-2013, 10:45 AM   #1
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default Opinions needed: Phi vs GPU in bioinfomatics

I am interested to know expert opinions in regard of pros and cons for use of Nvidia GPU - based and Xeon Phi coprocessor - based architectures for bioinfomatics applications. I realize that not all programs out there can take advantage of parallelization and need to be redesigned with help of significant programming efforts, yet if I have a choice of acquiring a dedicated server utilizing either of these platforms, what would be a better investment in regard of computing efficiency and perspectives?
yaximik is offline   Reply With Quote
Old 06-28-2013, 11:20 AM   #2
Lee Sam
Member
 
Location: Ann Arbor, MI

Join Date: Oct 2008
Posts: 57
Default

Quote:
Originally Posted by yaximik View Post
I am interested to know expert opinions in regard of pros and cons for use of Nvidia GPU - based and Xeon Phi coprocessor - based architectures for bioinfomatics applications. I realize that not all programs out there can take advantage of parallelization and need to be redesigned with help of significant programming efforts, yet if I have a choice of acquiring a dedicated server utilizing either of these platforms, what would be a better investment in regard of computing efficiency and perspectives?
We discussed this a little while back and I think it's really a question of what application you're trying to accelerate, and whether that task has been something that has had effort invested to apply GPU or Phi resources. One of the issues is that quite a few GPU accelerated projects haven't been particularly well maintained. Admittedly the Phi can run x86 code without modification (supposedly), but the performance boost is kind of an unknown for us.
Lee Sam is offline   Reply With Quote
Old 06-29-2013, 04:59 AM   #3
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

I posted a question in general, to get a broad opinion, although I realize that answer is much dependent on particular applications and needs. For exampe, right now I am running blastx from the Blast+ package on my dataset. On a grid utilizing on average 400-500 threads this has been running nonstop 2 months alerady and processed so far about 1/2 of the dataset. So this is obviously one candidate for more parallelization. Old-fashioned de novo assembly is another one, as available assemblers that use de Bruijn graphs so far produced dismal results, although I cannot admit I explored all options.
But my question was in a generic sense as to whether advantages and disadvantages of both platforms cam be compared. I found some generic comparisons elsewhere, but without specifics that are characteristic for bioinformatics tasks, so I thought it might be more productive to seek answers here.
yaximik is offline   Reply With Quote
Old 06-29-2013, 05:27 AM   #4
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by yaximik View Post
For exampe, right now I am running blastx from the Blast+ package on my dataset. On a grid utilizing on average 400-500 threads this has been running nonstop 2 months alerady and processed so far about 1/2 of the dataset.
I'm curious, what are you blasting, and against what? 2 months seems an awful long time to blast something. Also, why blastx? Wouldn't it be a lot faster to first predict proteins with your algorithm of choosing (I like FragGeneScan) and then blastp against a protein db (would also make more sense biologically since afaik your can't do multiple genetic codes with blastx at once)? Have you parallelized your blast properly? The num_threads option alone is a very poor solution. As a benchmark, blastp of some 2.5 million proteins against nr took me about 2 days on our cluster (I think 18 nodes with 16 Xeon cores and 512 GB RAM in each node and 2 nodes with 32 Xeon cores and 768 GB RAM each), however, I wasn't the only one using it. I parallelized the blasts by splitting input sequences and then calling an array of blasts in SGE with 8 threads in each blastp instance (at max I think I had maybe 300 simultaneous threads going)..

Last edited by rhinoceros; 06-29-2013 at 06:07 AM.
rhinoceros is offline   Reply With Quote
Old 06-29-2013, 10:35 AM   #5
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Perhaps this blastx is for metagenomics projects? In that case, have you tried to assemble reads/find long ORFs and deredundant the proteins, or to use established analysis methods/pipelines?

I also wonder why you consider de novo assemblies are "dismal" and how you think using GPU/Phi may improve the current situation.

Last edited by lh3; 06-29-2013 at 10:38 AM.
lh3 is offline   Reply With Quote
Old 06-29-2013, 12:46 PM   #6
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

I have something 200 million MiSeq reads now in a dozen or so files that are blastx' ed individually in 6 frames each against nr. I split each file in 500 chunks and go with an SGE array using 8-12 threads for each. on average it takes 4-6 days to complete one array job of 500 chunks. On the avearge, I can get 300-500 threads allocated on the grid for each array job. But this is just one iteration, so it is going to be a very long haul in a long run.

I did not know about FragGeneScan option, so I just use blastx. Is it better? The major issue is that I cannot use any reference. I tried to use the human genome, but got about 80% of the dataset filtered out due to lack of significant match. Since it is an archeological specimen, a lot of sequences are expected to be bacterial/fungal contamination, but that is manageable.

I tried to get de novo assembly using a few tools like Ray and got the longest contig of about 40 kb and a lot of shorter contigs, yet blastn' or blastx'ing did not really work as after about week the program crashed. Too long waiting for such result and splitting datasets with long contigs is much more problematic. So I resorted to analysis of individual reads wit the idea to anayse first the metagenomic content of each individual run from the dataset. Then I can remove obviously contaminating sequences (bacterial/fungal), then see what I can do with the rest.
yaximik is offline   Reply With Quote
Old 06-29-2013, 01:28 PM   #7
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

With blastx, you select a genetic code (default = 1, I think), so for example UGA will signal termination of translation. However, in many genetic codes, UGA = Trp. So especially in metagenomic studies (and everything related to mitochondria), you should always predict proteins first with some algorithm that takes this kind of things into account, and only then do blasts..

Did you dereplicate your reads prior to blasting? This might/probably would reduce their number significantly.

Last edited by rhinoceros; 06-29-2013 at 04:36 PM.
rhinoceros is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:42 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO