Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TBLASTX with nt database and PBS PRO job scheduler

    Hi I'm wondering if someone might be able to help.

    I'm attempting to do a TBLASTX search with 5 query sequences (each about ~1kb in size) against the nt database from NCBI (28990570 sequences).

    The nt database (renamed nt_fa in this example) has been split up into 22 x 1GB segments and has a corresponding .nal file.

    I'm using the PBS PRO job scheduler with a cluster at our university. I'm submitting the job as an array (which splits it up into 5 separate jobs).

    The qsub command:

    qsub -l select=1:ncpus=2:mem=8GB:NodeType=any -l walltime=72:00:00 -A sf-UQ -q workq -N BLAST /work1/xxzw/perl_blast_plus/output/job_submit.pbs

    The PBS script (job_submit.pbs) is as follows:

    #!/bin/bash -l
    #PBS -S /bin/bash
    #PBS -J 0-4
    /work1/xxzw/perl_blast_plus/temp/${PBS_ARRAY_INDEX}.sh

    the blast searches are executed from five different bash scripts (0.sh to 4.sh for the five query sequences test_0.fa to test_4.fa) which have the general form:

    #!/bin/bash
    /work1/xxzw/perl_blast_plus/ncbi-blast-2.2.30+/bin/tblastx -query /work1/xxzw/perl_blast_plus/temp/test_0.fa -num_descriptions 1 -num_alignments 1 -evalue 0.01 -db /work1/xxzw/perl_blast_plus/database/nt_fa -out /work1/xxzw/perl_blast_plus/output/test_0_fa_tblastx_nt_fa.blast -word_size 3 -num_threads 8

    Everything works ok if I use a smaller database (~5mb split up into five 1mb segments for testing purposes) but when I try and use the nt database I get five blast files that only contain the line:

    TBLASTX 2.2.30+

    ...and nothing else happens after many hours!.

    When I terminate the job....no errors or clues are reported in the STDOUT STDERR files for each job.

    I've checked the .nal files for the nt_fa database.....everything is fine. I've remade the nt_fa database from fasta files. Same thing.

    It seems the issue is to do with the size of the nt_fa database. I've tried increasing the number of processors to 8 and the memory to 22GB in the qsub statement with no effect.

    Any ideas what could be the problem?

    I'm still quite new to using PBS PRO. This is basically a small scale test for future blasting of several thousand query sequences.

    Any help would be appreciated.

  • #2
    When you are in investigative phase, dialing back on number of threads is a prudent thing. Start with a couple of threads and at least 24G of RAM. See if that job finishes. You would want to keep the threads on a physical server since having them going to different physical servers is not a good thing (I am not sure if you are already asking PBS Pro to do that, am an LSF/SGE user).

    With TBLASTX I am not sure if 24G is going to be enough. Why not try a TBLASTN against a protein database? What are you trying to map (metagenomic data)?
    Last edited by GenoMax; 03-31-2015, 05:27 PM.

    Comment


    • #3
      ah.....I left it go for a long time and eventually an insufficient memory error came up.....so that was it....

      It wasn't clear to me what the memory requirements for blast+ were (and whether having the database in segments would help reduce memory requirements).....now i understand a (tiny) bit better....

      Yes...it's a metagenomics based application....and yes...I think I'm going to have to adopt some alternative approaches to cut down the computing requirements....

      Thanks for your insights GenoMax...much appreciated..

      Comment


      • #4
        Was able to get a tblastx search to work against nt with 18G of RAM (and 4 threads) with a 2 kb query. That may be the best combination to start. Hopefully that will work with your cluster and PBS pro as well.

        This thread has some discussion that you should consider while planning your searches: http://seqanswers.com/forums/showthread.php?t=49464

        Comment


        • #5
          Thanks very much for the info Genomax....very helpful!

          The queues on our cluster for machines with >24GB physical memory (22GB available) are exceptionally long.....so 18GB sounds good....

          ...i'll give it a go this weekend and report back.

          Comment


          • #6
            quokka we will not know your cluster's file system or network setup but in general you want the data and the search to be as close as possible, and you want to reuse the data on hand as much as possible.

            note that your five 1k query sequences are insignificant compared with the 22) 1GB blast databases
            so blasting all 5 sequences against whichever shard of the nt_fa was in memory is preferable.

            Blasting your sequences against each of the 22 shards can happen concurrently.

            Being able to access the nt_fa database from a node, is not the same as that data being local to the node. This may mean you will see a speed up by first copying the blast database to a disk on the node perhaps a "scratch" disk. (see your cluster documentation or sysadmin about )

            In summary

            launch 22 jobs each which pull a 1GB database shard
            (into local storage if possible to avoid refetching next query)
            run a search for each of your query sequences against the local shard
            combines the results
            Last edited by tomc; 04-05-2015, 01:56 PM.

            Comment


            • #7
              Ok...it took me a long time before i was able to get back to this......

              I essentially tried what tomc suggested. The problem I found was that the e-values were a bit different to what I was expecting....which complicates things when looking for a best blast hit....

              I gather this discrepancy results from splitting up the database...which affects database size....which influences calculation of the e-value...

              mpiblast is meant to overcome this problem....but i ran into a different problem with that ( http://seqanswers.com/forums/showthread.php?t=49325 )

              ......more tweaking is required!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X