SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   TBLASTX with nt database and PBS PRO job scheduler (http://seqanswers.com/forums/showthread.php?t=51488)

quokka 03-31-2015 04:54 PM

TBLASTX with nt database and PBS PRO job scheduler
 
Hi I'm wondering if someone might be able to help.

I'm attempting to do a TBLASTX search with 5 query sequences (each about ~1kb in size) against the nt database from NCBI (28990570 sequences).

The nt database (renamed nt_fa in this example) has been split up into 22 x 1GB segments and has a corresponding .nal file.

I'm using the PBS PRO job scheduler with a cluster at our university. I'm submitting the job as an array (which splits it up into 5 separate jobs).

The qsub command:

qsub -l select=1:ncpus=2:mem=8GB:NodeType=any -l walltime=72:00:00 -A sf-UQ -q workq -N BLAST /work1/xxzw/perl_blast_plus/output/job_submit.pbs

The PBS script (job_submit.pbs) is as follows:

#!/bin/bash -l
#PBS -S /bin/bash
#PBS -J 0-4
/work1/xxzw/perl_blast_plus/temp/${PBS_ARRAY_INDEX}.sh

the blast searches are executed from five different bash scripts (0.sh to 4.sh for the five query sequences test_0.fa to test_4.fa) which have the general form:

#!/bin/bash
/work1/xxzw/perl_blast_plus/ncbi-blast-2.2.30+/bin/tblastx -query /work1/xxzw/perl_blast_plus/temp/test_0.fa -num_descriptions 1 -num_alignments 1 -evalue 0.01 -db /work1/xxzw/perl_blast_plus/database/nt_fa -out /work1/xxzw/perl_blast_plus/output/test_0_fa_tblastx_nt_fa.blast -word_size 3 -num_threads 8

Everything works ok if I use a smaller database (~5mb split up into five 1mb segments for testing purposes) but when I try and use the nt database I get five blast files that only contain the line:

TBLASTX 2.2.30+

...and nothing else happens after many hours!.

When I terminate the job....no errors or clues are reported in the STDOUT STDERR files for each job.

I've checked the .nal files for the nt_fa database.....everything is fine. I've remade the nt_fa database from fasta files. Same thing.

It seems the issue is to do with the size of the nt_fa database. I've tried increasing the number of processors to 8 and the memory to 22GB in the qsub statement with no effect.

Any ideas what could be the problem?

I'm still quite new to using PBS PRO. This is basically a small scale test for future blasting of several thousand query sequences.

Any help would be appreciated.

GenoMax 03-31-2015 05:23 PM

When you are in investigative phase, dialing back on number of threads is a prudent thing. Start with a couple of threads and at least 24G of RAM. See if that job finishes. You would want to keep the threads on a physical server since having them going to different physical servers is not a good thing (I am not sure if you are already asking PBS Pro to do that, am an LSF/SGE user).

With TBLASTX I am not sure if 24G is going to be enough. Why not try a TBLASTN against a protein database? What are you trying to map (metagenomic data)?

quokka 04-01-2015 12:25 AM

ah.....I left it go for a long time and eventually an insufficient memory error came up.....so that was it....

It wasn't clear to me what the memory requirements for blast+ were (and whether having the database in segments would help reduce memory requirements).....now i understand a (tiny) bit better....

Yes...it's a metagenomics based application....and yes...I think I'm going to have to adopt some alternative approaches to cut down the computing requirements....

Thanks for your insights GenoMax...much appreciated..

GenoMax 04-01-2015 06:10 AM

Was able to get a tblastx search to work against nt with 18G of RAM (and 4 threads) with a 2 kb query. That may be the best combination to start. Hopefully that will work with your cluster and PBS pro as well.

This thread has some discussion that you should consider while planning your searches: http://seqanswers.com/forums/showthread.php?t=49464

quokka 04-02-2015 07:35 PM

Thanks very much for the info Genomax....very helpful!

The queues on our cluster for machines with >24GB physical memory (22GB available) are exceptionally long.....so 18GB sounds good....

...i'll give it a go this weekend and report back.

tomc 04-05-2015 01:45 PM

quokka we will not know your cluster's file system or network setup but in general you want the data and the search to be as close as possible, and you want to reuse the data on hand as much as possible.

note that your five 1k query sequences are insignificant compared with the 22) 1GB blast databases
so blasting all 5 sequences against whichever shard of the nt_fa was in memory is preferable.

Blasting your sequences against each of the 22 shards can happen concurrently.

Being able to access the nt_fa database from a node, is not the same as that data being local to the node. This may mean you will see a speed up by first copying the blast database to a disk on the node perhaps a "scratch" disk. (see your cluster documentation or sysadmin about )

In summary

launch 22 jobs each which pull a 1GB database shard
(into local storage if possible to avoid refetching next query)
run a search for each of your query sequences against the local shard
combines the results

quokka 12-01-2015 08:15 PM

Ok...it took me a long time before i was able to get back to this......

I essentially tried what tomc suggested. The problem I found was that the e-values were a bit different to what I was expecting....which complicates things when looking for a best blast hit....

I gather this discrepancy results from splitting up the database...which affects database size....which influences calculation of the e-value...

mpiblast is meant to overcome this problem....but i ran into a different problem with that ( http://seqanswers.com/forums/showthread.php?t=49325 )

......more tweaking is required!


All times are GMT -8. The time now is 07:58 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.