Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SGE and ncbi-blast-2.2.28+

    Hello,

    I've predicted genes from metagenomic assemblies with FragGeneScan. The next step is to query the predicted peptides against NCBI's nr database. My cluster consists of sixteen Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, which makes up 256 threads altogether. An option would be to use the '-num_threads' flag in blast. However, in my experience, this doesn't parallelize the task entirely.

    So what I'm going to do is to run blast with SGE using the below script (after modifying it suitable for blast-2.2.28+). Here's more info.

    Code:
    #!/bin/bash
    #
    #$ -cwd
    #$ -S /bin/bash
    #$ -j y
    
    export BLASTDB=/share/bio/ncbi/db/
    export BLASTMAT=/opt/Bio/ncbi/data/
    
    export PATH=$PATH:/opt/Bio/ncbi/bin
    
    blastall -d patnt -p blastn -i $HOME/test.txt -o $HOME/result.txt
    I have no previous experience with SGE (all I know is that it's setup on the cluster I'm using). So my question is, should I omit the '-num_threads' flag from my query entirely?
    savetherhino.org

  • #2
    Have a look at this as another option: http://www3.imperial.ac.uk/bioinfsup...ing_array_jobs

    Even though you have 16 CPU's how much memory do you have available for each? You may need about ~10G per job if you are going to search against "nr".

    Comment


    • #3
      Originally posted by GenoMax View Post
      Have a look at this as another option: http://www3.imperial.ac.uk/bioinfsup...ing_array_jobs

      Even though you have 16 CPU's how much memory do you have available for each? You may need about ~10G per job if you are going to search against "nr".
      Thanks, the link is very helpful. The cluster has 256G RAM. So, I suppose a good solution would be to run 16 independent tasks with 16 threads in each.
      savetherhino.org

      Comment


      • #4
        Originally posted by rhinoceros View Post
        Thanks, the link is very helpful. The cluster has 256G RAM. So, I suppose a good solution would be to run 16 independent tasks with 16 threads in each.
        What kind of a cluster is this?

        Most commodity clusters have nodes with a certain amount of RAM (e.g. on a cluster I access there are blades with dual quad core xeon CPU's accessing 72GB of local RAM) and then there clusters with "shared" memory access (e.g. NUMA). I have not seen cluster of the latter kind in common use of late.

        Is your cluster the latter type when you say that you have 256G RAM? Or do you actually have 256G RAM on each node (not completely unlikely now-a-days)?

        Unless you are the only person using this cluster you may not be able to spawn off those many jobs simultaneously. Then there will be some dependence on the type/speed of storage.
        Last edited by GenoMax; 04-12-2013, 09:21 AM.

        Comment


        • #5
          Originally posted by GenoMax View Post
          What kind of a cluster is this?

          Most commodity clusters have nodes with a certain amount of RAM (e.g. on a cluster I access there are blades with dual quad core xeon CPU's accessing 72GB of local RAM) and then there clusters with "shared" memory access (e.g. NUMA).

          Is your cluster the latter type when you say that you have 256G RAM? Or do you actually have 256G RAM on each node (not completely unlikely now-a-days)?
          I'm not 100% sure, but I think the cluster consists of 16 Dell R620's, i.e. 16 GB RAM in each node..

          Code:
          cat /proc/meminfo
          MemTotal:       264635596 kB
          ..
          savetherhino.org

          Comment


          • #6
            Originally posted by rhinoceros View Post
            I'm not 100% sure, but I think the cluster consists of 16 Dell R620's, i.e. 16 GB RAM in each node..

            Code:
            cat /proc/meminfo
            MemTotal:       264635596 kB
            ..
            So you do have a cluster of the first type and the cluster head-node does seem to have 256GB RAM (assuming that is where you ran the cat command).

            Not sure if your sys admins allow you to run jobs on head-node ....

            If the worker nodes have only 16GB RAM each then you are not going to be able to perhaps run more than one job per node (you could but then things will use swap/tmp and everything will be slow). I suggest experimenting with test jobs allocating different memory to see if you could squeeze in two jobs per node.

            Comment


            • #7
              Hello again,

              Will the following result in 16 parallel instances of blast with each instance running 16 threads? Original input.fasta has been divided into 16 files named input.1 - input.16.

              qsub -t 1-16:1 blastp-sge.sh

              Code:
              #!/bin/bash
              #$ -N blastp
              #$ -j y
              #$ -cwd
              #$ -l h_vmem=2G -pe smp 8
              #$ -R y
              /path/to/ncbi-blast/2.2.28+/bin/blastp -query input.${SGE_TASK_ID} -db /path/to/db/nr -seg yes -soft_masking true -use_sw_tback -evalue 1e-5 -outfmt "6 qseqid sseqid sgi staxids pident length mismatch gapopen qstart qend sstart send evalue bitscore" -num_threads 16 -out ${SGE_TASK_ID}.tsv
              Output would be 1.tsv - 16.tsv which could be merged easily. I'm having particularly hard time understanding the '#$ -l h_vmem=2G -pe smp 8' line.
              Last edited by rhinoceros; 04-13-2013, 08:25 AM.
              savetherhino.org

              Comment


              • #8
                Originally posted by rhinoceros View Post
                I'm having particularly hard time understanding the '#$ -l h_vmem=2G -pe smp 8' line.
                The h_vmem parameter has to do with the memory allocation for the job. This page has info about this parameter: http://www.biostat.jhsph.edu/bit/clu...e.html#MemSpec

                The "pe" part refers to a parallel environment (if there is one set up on your cluster). This would be related to "num_threads" part for your blast jobs as described here: http://www3.imperial.ac.uk/bioinfsup..._parallel_jobs

                You may want to confer with your local SGE admin about the right parameters to set for the queues you have access to.

                Comment


                • #9
                  Everything is working now. My script blastp.sh is as follows:

                  Code:
                  #!/bin/bash
                  #$ -V
                  #$ -N blastp
                  #$ -j y
                  #$ -cwd
                  #$ -pe orte 16
                  /path/to/ncbi-blast/2.2.28+/bin/blastp -query input.${SGE_TASK_ID} -db /path/to/db/nr -lotsOfFlags -outfmt 6 -num_threads 16 -out ${SGE_TASK_ID}.tsv
                  The input is a fasta file that I have split to 20 parts with fastasplitn (input.1, input.2, .., input.20). I call the script from the same dir as follows: qsub -t 1-20:1 blastp.sh

                  So I'm running in this case 20 parallel blasts with 16 threads in each (though actually some of them are in the queue). Output is 1.tsv, 2.tsv, .., 20.tsv which I'll merge by

                  cat 1.tsv 2.tsv .. 20.tsv > blast_result.tsv

                  And that's that. I hope others might find this useful..
                  savetherhino.org

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X