Seqanswers Leaderboard Ad

**maubp** · 12-30-2012, 12:50 PM

Did you use the -num_threads option? The speed up isn't linear but does work pretty well.

The other good strategy is to explore is split your 16,000 queries into batches - say 16 sets of 1000 queries each, and try running multiple jobs as once (limited by the RAM and size of the database)

**yaximik** · 12-31-2012, 08:05 AM

Well, the problem is that these 16,000 are already a chunk of a bigger data set of about 300,000 sequences, with 200 nt being the minimum length cutoff. I realize that it is a formidable task to process them by blastx, perhaps I need to resort to cloud resources. I just tried a simpler task to see if I get any hits.

I thought that to use multithreading blast+ should be configured for that, but I am not sure if precompiled binaries were configured with this support. Were they?

**maubp** · 12-31-2012, 11:19 AM

If you are using the NCBI provided BLAST+ binaries they should support multi-threading, but they do not do this automatically - you must request it via the command line argument -num_threads in order to use this.

**severin** · 01-01-2013, 04:46 AM

gpu blast

if you have a gpu card you could see if it is faster than your 16 cores. I am getting about 1500 sequences processed a minute against the SEED database of 1.7 million sequences with an Nvidia M2075. I am still experimenting with the best parameters to optimize performance.

**mjp** · 01-01-2013, 05:40 AM

A bit old but always valid - taken from http://1.usa.gov/10IIuCr:
"(...)The BLAST+ applications have a number of improvements that allow faster searches (...) These improvements include: splitting of longer queries so as to reduce the memory usage and to take advantage of modern CPU architectures; use of a database index to dramatically speed up the search; the ability to save a “search strategy” that can be used later to start a new search; and greater flexibility in the formatting of tabular results (...)."

Have you used the build-in features along with multi-threading?
What I would do is: Index db, split long queries internally, format output to be as simple as possible.

Hope that helps.

**yaximik** · 01-01-2013, 09:24 AM

Thanks for input, that is overwhelming how much I need to learn. Before I got MiSeq, I was using online BLAST with its pretty interface. Once NCBI kicked me out because of sharply increased data volume, I had to make a local install. Now it is time to learn nuts and bolts of CLI. Thanks for advice, maupb and mjp, it is time to digest the manual.

if you have a gpu card you could see if it is faster than your 16 cores. I am getting about 1500 sequences processed a minute against the SEED database of 1.7 million sequences with an Nvidia M2075. I am still experimenting with the best parameters to optimize performance.

Wow, that looks very promising! Could you enlighten me a bit more, as this is yet another set of nuts and bolts I have to learn on the fly. GPU - graphic processor unit, correct? Perhaps I could find and install one, but how do you plug it in to work with blast+, I presume instead of motherboard CPU?

**GenoMax** · 01-02-2013, 05:40 AM

Originally posted by yaximik View Post

Wow, that looks very promising! Could you enlighten me a bit more, as this is yet another set of nuts and bolts I have to learn on the fly. GPU - graphic processor unit, correct? Perhaps I could find and install one, but how do you plug it in to work with blast+, I presume instead of motherboard CPU?

You are right about GPU. This is probably only going to work if you have your own machine (and are using some version of *unix).

Nvidia has a primer on GPU computing available here: http://www.nvidia.com/object/what-is-gpu-computing.html

Guide to use blast+ with GPU's is available: http://eudoxus.cheme.cmu.edu/gpublast/gpublast.html

If you use a compute cluster for analyses, the admins may not take it kindly to adding GPU's to compute nodes (if your cluster users blades then that would not even be an option).

There are dedicated nodes (that can be part of a cluster) with just GPUs. An example is described here: http://en.community.dell.com/techcen...wiki/2428.aspx

**yaximik** · 01-02-2013, 06:19 AM

Thanks, GenoMax for links. Yes, I do have T610 (tower), which I administer myself, and preliminary consultations with Dell Tech support suggest that the latest K20x could fit in there. On the other hand, nVidia web site indicates that both M2050/2090 and K20 series can work with RHEL54 and guys from LiinuxQuestions think it should work with RHEL58 too. I need to talk to nVidia experts to find if there may be issues with PCI compatibility, unless this is standard and the main requirements are drivers. Making this work would be awesome.

**gsgs** · 01-02-2013, 09:44 AM

I don't know blastx .
nt = nucleotides ?
but refseg_protein is amino acids ?!
how big is refseg_protein , can you post a sample so I try ?
10 out of 16000 or such then I estimate the time

**yaximik** · 01-02-2013, 09:55 AM

I don't know blastx .
nt = nucleotides ?
but refseg_protein is amino acids ?!
how big is refseg_protein , can you post a sample so I try ?
10 out of 16000 or such then I estimate the time

blastx takes nucleotide sequences, translates in 6 frames and compares against a protein sequence database, in this case blast-prefomatted NCBI refseq_protein. 10 sequences are processed in a zip on my box...

**gsgs** · 01-02-2013, 10:04 AM

this ones ?

ftp://ftp.ncbi.nlm.nih.gov/blast/db/

File: refseq_protein.00.tar.gz 828068 KB 31.12.2012 02:30:00
File: refseq_protein.01.tar.gz 713984 KB 31.12.2012 02:31:00
File: refseq_protein.02.tar.gz 696391 KB 31.12.2012 02:31:00
File: refseq_protein.03.tar.gz 673329 KB 31.12.2012 02:32:00
File: refseq_protein.04.tar.gz 679706 KB 31.12.2012 02:32:00
File: refseq_protein.05.tar.gz 590931 KB 31.12.2012 02:33:00

-------------------------------

so it translates your nt into amino acids and searches for
best matches with their > 10GB of proteins for each of the 16000 *6 ?

----------------------------------------
... reading : http://en.wikipedia.org/wiki/BLAST
(sounds complicated)
------------------------------------

**yaximik** · 01-02-2013, 10:12 AM

Yep, this one. That is the task, which needs to be done for each sequence of ~300,000 seqs data set, totalling about 111 MB only. Fun, huh? Is there a better way to do this, as I do not have a nucleotide reference?

**gsgs** · 01-02-2013, 10:47 AM

seems to depend on the scoring matrix says:

Blast - Wikipedia

http://en.wikipedia.org/wiki/BLAST

?

BLOSUM62

-----------------------------------

Accelerated versions
CLC bio and SciEngines GmbH collaborate on an FPGA accelerator they claim will give 188x acceleration of BLAST.
TimeLogic offers another FPGA-accelerated implementation of the BLAST algorithm called Tera-BLAST.
The Mitrion-C Open Bio Project is an ongoing effort to port BLAST to run on Mitrion FPGAs.
The CUDA-BLASTP is a version of BLASTP that is GPU-accelerated and is claimed to run up to 10x faster than NCBI BLAST.

-------------------------------------------------
An extremely fast but considerably less sensitive alternative to BLAST is BLAT (Blast Like Alignment Tool). While BLAST does a linear search, BLAT relies on k-mer indexing the database, and can thus often find seeds faster.

Recent advances in sequencing technology has made searching for very similar nucleotide matches an important problem. New alignment programs tailored for this use typically use BWT-indexing of the target database (typically a genome). Input sequences can then be mapped very quickly, and output is typically in the form of a BAM file. Example alignment programs are BWA, SOAP, and Bowtie.

For protein identification, searching for known domains (for instance from Pfam) by matching with Hidden Markov Models is a popular alternative, such as Hmmer.

------------------------------------------------------

**pallevillesen** · 01-03-2013, 05:28 AM

So have you simply tested the

-num_threads <Integer, >=1>
Number of threads (CPUs) to use in the BLAST search
Default = `1'

option as suggested? And how did it go?

Anyway. for blasting 300k sequences I would find a cluster somewhere and parallelize it on as many nodes as possible.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 23 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

how to speed up blast+

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News