Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running Blast+ on multiple nodes on a cluster -- what is the best way to that?

    Hi, I have been recently granted access to the HPC cluster of my university. I am going to run several blastx searches (Blast+ version, not legacy blast) there to identify potential virulence factors and toxins in Illumina metagenomic datasets.

    The cluster I will be using has the following characteristics:
    Nodes: 112 Dell R410 (quad-core Xeons, 8 threads) with 24Gb RAM each. I can use up to 6 nodes at once.
    OS: RHEL v 5
    Queuing system: Torque (PBS)

    The problem is, I am a molecular biologist with no formal bioinformatics training and absolutely no previous experience with HPC clusters. I am also the first one to use this cluster for biology-related computations, and, as it has been used only by physicists and mathematicians so far, IT guys are unable to help me with my questions.

    So I would like to ask people with more knowledge on that topic, what would be the best way to run my blast searches? As far as I understood from reading other posts (http://seqanswers.com/forums/showthread.php?t=29760 and http://seqanswers.com/forums/showthread.php?t=40048) and blast+ documentation, blast+ does support multithreading, but has no built-in means to parallelise runs on different CPUs/PCs/nodes. Should I split my fasta files, run 6 independent 8-threaded instances of blast search on 6 nodes, and combine blast outputs in the end?

    On a side note, I would be very grateful if someone could recommend me a short intro into HPC computing for biologists, so I wouldn't bother busy people with newbie questions any longer.

  • #2
    If you had an input query FASTA file of (for example) 1000 query sequences, then I would split this into several separate FASTA files (e.g. ten files of 100 sequences each), and submit them to the cluster as ten jobs, and then combine the BLAST output.

    Each BLAST job could be set to ask for a single machine with 8 threads. Or, there is flexibility here - while BLAST does get faster when given more threads, this is not perfect - so it might be faster overall to use four threads for each BLAST job (meaning on your cluster, there could be two BLAST jobs running at the same time - fine if you have enough RAM).

    Comment


    • #3
      maubp, thank you for your suggestion, I will try it tomorrow.

      Just wanted to clarify, when you mentioned two BLAST jobs running at the same time, did you mean that they would be running on the same node simultaneously? So, if I have 6 nodes with 8 threads on each and I submit 12 blast jobs with 4 threads for each, there would be 12 independent blast instances (jobs) running in parallel, assuming that the memory is not a problem?

      Comment


      • #4
        Yes - assuming your limit is really six machines at once. I'm not familiar enough with Torque/PBS to really guess, but it could be you are limited to six active jobs at once?

        Comment


        • #5
          Thank you for clarification, and you are probably right about the jobs limit, but I am not really sure about that. I was told that I can use 48 threads per job at most, so all the rest are just my guesses, and I might be completely wrong.

          Comment


          • #6
            With 24GB RAM per node you should not start a lot of threads. Size of the database you are going to search against is going to determine the outcome here. Blastx searches are compute intensive as is.

            I would recommend that you run one or two exploratory jobs (start with 4 and 8 threads, keep threads from a job on the same node) and allocate the maximum RAM you are allowed to use (with 24G physical RAM you can probably use 20-22G at most for the job, provided nothing else is running on node) and see how much RAM is actually used by the job in the log. Depending on results you can then decide on number of threads to use per node.

            Comment


            • #7
              Thank you for your suggestions as well, GenoMax. I made a script (virulence.sh) for Torque, which would submit 6 jobs of blastx with 4 threads per each node (total 24 threads):

              Code:
              #!/bin/bash
              
              
              #Setting Torque parameters
              #PBS -N vir_blast
              #PBS -j oe
              #PBS -m abe
              #PBS -M [email protected]
              #PBS -q main_queue
              #PBS -l mem=22000mb
              #PBS -l nodes=1:ppn=4
              #PBS -t 0-5
              
              #Loading modules
              module add shared
              module add torque
              module add blastx
              
              
              #Executing commands
              
              cd $PBS_O_WORKDIR
              
              #Each blastx instance cosists of one main thread and 'k' working threads, whose number is specified by '-num_threads' parameter
              #Thus, to use 4 CPU threads per node '-num_threads' should be set to 3 (1 main and 3 worker blastx threads will be created)
               
              blastx -db virDB -query ./meta_chunk_${PBS_ARRAYID}.fa -e 1e-5 -num_threads 3 -otfmt 6 -out ./results_chunk_${PBS_ARRAYID}.fm6
              The script will be executed with 'qsub virulence.sh' command.

              Could someone take a look at the script and tell me if it looks fine?

              I am still trying to comprehend how queueing actually works. Let's assume I have an idle cluster with 6 nodes, 8 cores on each, so 48 cores in total. If I request 12 independent jobs with 4 cores per job (#PBS -l nodes=1pn=4, #PBS -t 0-11), what would happen? Will all my jobs run simultaneously on the cluster, with 2 jobs running on each node, or will only 6 jobs be started with 6 others waiting in the queue?

              Comment


              • #8
                Originally posted by TauOvermind View Post
                I am still trying to comprehend how queueing actually works. Let's assume I have an idle cluster with 6 nodes, 8 cores on each, so 48 cores in total. If I request 12 independent jobs with 4 cores per job (#PBS -l nodes=1pn=4, #PBS -t 0-11), what would happen? Will all my jobs run simultaneously on the cluster, with 2 jobs running on each node, or will only 6 jobs be started with 6 others waiting in the queue?
                If you only consider job slots then technically your 12 independent jobs will start at the same time, if all 48 cores are idle. But in this case we are ignoring other requests you are making. e.g. if you request 22G of memory for each job then one one of the jobs can run at a give time considering your nodes have 24G of RAM. So a job scheduler takes into account a combination of what you request in terms of resources then matches it with what you have access to/are allowed to use based on local "fare share" policy and ultimately what the current load status is for the cluster (how busy are the nodes, are all job slots full etc).

                Here is an example page of how PBS is used: http://arc.research.umich.edu/flux-a...ux/pbs-basics/
                Last edited by GenoMax; 01-11-2015, 05:28 PM.

                Comment


                • #9
                  BTW: I don't know why you are referring to a main thread and working threads in your script. Here is an example of a PBS script that starts with "n" CPU's and equal number of threads: http://swes.cals.arizona.edu/maier_l...ome/docs/blast

                  Comment


                  • #10
                    Thank you very much for both your explanations and the links you provided, GenoMax. I have spent a lot of time googling for a good example of a BLAST+ script for a cluster with Torque/PBS today, but haven't seen the second page. I read about main and working threads of BLAST+ here, but now I am confused:


                    I will try to run a test job tomorrow.

                    Comment


                    • #11
                      I don't use PBS but it does appear that the information at the link you provided indicates that a core is needed for the "main" job. Try with n = CPU = threads first and see what happens.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      68 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X