Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to speed up blast+

    This may sound naive, but is there a way to speed up blast? I have a local instance of latest blast+ and all needed p reformatted libraries on 64 bit RHEL58 box, which has 16 processors. I was a bit surprised though that a relatively simple job involving blastx of 16,000 sequences 200 nt each against refseq_protein took over 2 days. When I examined 'top' output while the job was running, each processor was occupied at about 6% (totaling 100%), so as this was the only job running, processors did not seem were used very efficiently. Is there a way to speed up the job by using more processor load? May be I need to recompile blast+ from the source (I just followed instructions installing precompiled binaries) to enable multithreading or something? Thanks for input.

  • #2
    Did you use the -num_threads option? The speed up isn't linear but does work pretty well.

    The other good strategy is to explore is split your 16,000 queries into batches - say 16 sets of 1000 queries each, and try running multiple jobs as once (limited by the RAM and size of the database)

    Comment


    • #3
      Well, the problem is that these 16,000 are already a chunk of a bigger data set of about 300,000 sequences, with 200 nt being the minimum length cutoff. I realize that it is a formidable task to process them by blastx, perhaps I need to resort to cloud resources. I just tried a simpler task to see if I get any hits.

      I thought that to use multithreading blast+ should be configured for that, but I am not sure if precompiled binaries were configured with this support. Were they?
      Last edited by yaximik; 12-31-2012, 08:07 AM.

      Comment


      • #4
        If you are using the NCBI provided BLAST+ binaries they should support multi-threading, but they do not do this automatically - you must request it via the command line argument -num_threads in order to use this.

        Comment


        • #5
          gpu blast

          if you have a gpu card you could see if it is faster than your 16 cores. I am getting about 1500 sequences processed a minute against the SEED database of 1.7 million sequences with an Nvidia M2075. I am still experimenting with the best parameters to optimize performance.

          Comment


          • #6
            A bit old but always valid - taken from http://1.usa.gov/10IIuCr:
            "(...)The BLAST+ applications have a number of improvements that allow faster searches (...) These improvements include: splitting of longer queries so as to reduce the memory usage and to take advantage of modern CPU architectures; use of a database index to dramatically speed up the search; the ability to save a “search strategy” that can be used later to start a new search; and greater flexibility in the formatting of tabular results (...)."

            Have you used the build-in features along with multi-threading?
            What I would do is: Index db, split long queries internally, format output to be as simple as possible.

            Hope that helps.

            Comment


            • #7
              Thanks for input, that is overwhelming how much I need to learn. Before I got MiSeq, I was using online BLAST with its pretty interface. Once NCBI kicked me out because of sharply increased data volume, I had to make a local install. Now it is time to learn nuts and bolts of CLI. Thanks for advice, maupb and mjp, it is time to digest the manual.

              if you have a gpu card you could see if it is faster than your 16 cores. I am getting about 1500 sequences processed a minute against the SEED database of 1.7 million sequences with an Nvidia M2075. I am still experimenting with the best parameters to optimize performance.
              Wow, that looks very promising! Could you enlighten me a bit more, as this is yet another set of nuts and bolts I have to learn on the fly. GPU - graphic processor unit, correct? Perhaps I could find and install one, but how do you plug it in to work with blast+, I presume instead of motherboard CPU?

              Comment


              • #8
                Originally posted by yaximik View Post

                Wow, that looks very promising! Could you enlighten me a bit more, as this is yet another set of nuts and bolts I have to learn on the fly. GPU - graphic processor unit, correct? Perhaps I could find and install one, but how do you plug it in to work with blast+, I presume instead of motherboard CPU?
                You are right about GPU. This is probably only going to work if you have your own machine (and are using some version of *unix).

                Nvidia has a primer on GPU computing available here: http://www.nvidia.com/object/what-is-gpu-computing.html

                Guide to use blast+ with GPU's is available: http://eudoxus.cheme.cmu.edu/gpublast/gpublast.html

                If you use a compute cluster for analyses, the admins may not take it kindly to adding GPU's to compute nodes (if your cluster users blades then that would not even be an option).

                There are dedicated nodes (that can be part of a cluster) with just GPUs. An example is described here: http://en.community.dell.com/techcen...wiki/2428.aspx

                Comment


                • #9
                  Thanks, GenoMax for links. Yes, I do have T610 (tower), which I administer myself, and preliminary consultations with Dell Tech support suggest that the latest K20x could fit in there. On the other hand, nVidia web site indicates that both M2050/2090 and K20 series can work with RHEL54 and guys from LiinuxQuestions think it should work with RHEL58 too. I need to talk to nVidia experts to find if there may be issues with PCI compatibility, unless this is standard and the main requirements are drivers. Making this work would be awesome.

                  Comment


                  • #10
                    I don't know blastx .
                    nt = nucleotides ?
                    but refseg_protein is amino acids ?!
                    how big is refseg_protein , can you post a sample so I try ?
                    10 out of 16000 or such then I estimate the time

                    Comment


                    • #11
                      I don't know blastx .
                      nt = nucleotides ?
                      but refseg_protein is amino acids ?!
                      how big is refseg_protein , can you post a sample so I try ?
                      10 out of 16000 or such then I estimate the time
                      blastx takes nucleotide sequences, translates in 6 frames and compares against a protein sequence database, in this case blast-prefomatted NCBI refseq_protein. 10 sequences are processed in a zip on my box...

                      Comment


                      • #12
                        this ones ?

                        ftp://ftp.ncbi.nlm.nih.gov/blast/db/

                        File: refseq_protein.00.tar.gz 828068 KB 31.12.2012 02:30:00
                        File: refseq_protein.01.tar.gz 713984 KB 31.12.2012 02:31:00
                        File: refseq_protein.02.tar.gz 696391 KB 31.12.2012 02:31:00
                        File: refseq_protein.03.tar.gz 673329 KB 31.12.2012 02:32:00
                        File: refseq_protein.04.tar.gz 679706 KB 31.12.2012 02:32:00
                        File: refseq_protein.05.tar.gz 590931 KB 31.12.2012 02:33:00

                        -------------------------------

                        so it translates your nt into amino acids and searches for
                        best matches with their > 10GB of proteins for each of the 16000 *6 ?

                        ----------------------------------------
                        ... reading : http://en.wikipedia.org/wiki/BLAST
                        (sounds complicated)
                        ------------------------------------
                        Last edited by gsgs; 01-02-2013, 04:24 PM.

                        Comment


                        • #13
                          Yep, this one. That is the task, which needs to be done for each sequence of ~300,000 seqs data set, totalling about 111 MB only. Fun, huh? Is there a better way to do this, as I do not have a nucleotide reference?

                          Comment


                          • #14
                            seems to depend on the scoring matrix says:


                            ?

                            BLOSUM62


                            -----------------------------------

                            Accelerated versions
                            CLC bio and SciEngines GmbH collaborate on an FPGA accelerator they claim will give 188x acceleration of BLAST.
                            TimeLogic offers another FPGA-accelerated implementation of the BLAST algorithm called Tera-BLAST.
                            The Mitrion-C Open Bio Project is an ongoing effort to port BLAST to run on Mitrion FPGAs.
                            The CUDA-BLASTP is a version of BLASTP that is GPU-accelerated and is claimed to run up to 10x faster than NCBI BLAST.

                            -------------------------------------------------
                            An extremely fast but considerably less sensitive alternative to BLAST is BLAT (Blast Like Alignment Tool). While BLAST does a linear search, BLAT relies on k-mer indexing the database, and can thus often find seeds faster.

                            Recent advances in sequencing technology has made searching for very similar nucleotide matches an important problem. New alignment programs tailored for this use typically use BWT-indexing of the target database (typically a genome). Input sequences can then be mapped very quickly, and output is typically in the form of a BAM file. Example alignment programs are BWA, SOAP, and Bowtie.

                            For protein identification, searching for known domains (for instance from Pfam) by matching with Hidden Markov Models is a popular alternative, such as Hmmer.

                            ------------------------------------------------------
                            Last edited by gsgs; 01-02-2013, 11:25 AM.

                            Comment


                            • #15
                              So have you simply tested the

                              -num_threads <Integer, >=1>
                              Number of threads (CPUs) to use in the BLAST search
                              Default = `1'

                              option as suggested? And how did it go?

                              Anyway. for blasting 300k sequences I would find a cluster somewhere and parallelize it on as many nodes as possible.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X