SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
tblastx fmt1 Output Interpretation syintel87 Bioinformatics 5 03-06-2015 09:33 AM
Resource Manager/Job scheduler jaysantos Bioinformatics 3 09-01-2013 11:20 PM
CASAVA running on PBS pro westerman Illumina/Solexa 6 05-03-2013 11:30 AM
tblastx error-ncbi-blast-2.2.26+ bioman1 Bioinformatics 7 02-13-2013 06:04 AM
TBLASTX 2.2.25 problem natstreet General 0 05-05-2011 02:07 PM

Reply
 
Thread Tools
Old 03-31-2015, 05:54 PM   #1
quokka
Junior Member
 
Location: oz

Join Date: Apr 2010
Posts: 9
Default TBLASTX with nt database and PBS PRO job scheduler

Hi I'm wondering if someone might be able to help.

I'm attempting to do a TBLASTX search with 5 query sequences (each about ~1kb in size) against the nt database from NCBI (28990570 sequences).

The nt database (renamed nt_fa in this example) has been split up into 22 x 1GB segments and has a corresponding .nal file.

I'm using the PBS PRO job scheduler with a cluster at our university. I'm submitting the job as an array (which splits it up into 5 separate jobs).

The qsub command:

qsub -l select=1:ncpus=2:mem=8GB:NodeType=any -l walltime=72:00:00 -A sf-UQ -q workq -N BLAST /work1/xxzw/perl_blast_plus/output/job_submit.pbs

The PBS script (job_submit.pbs) is as follows:

#!/bin/bash -l
#PBS -S /bin/bash
#PBS -J 0-4
/work1/xxzw/perl_blast_plus/temp/${PBS_ARRAY_INDEX}.sh

the blast searches are executed from five different bash scripts (0.sh to 4.sh for the five query sequences test_0.fa to test_4.fa) which have the general form:

#!/bin/bash
/work1/xxzw/perl_blast_plus/ncbi-blast-2.2.30+/bin/tblastx -query /work1/xxzw/perl_blast_plus/temp/test_0.fa -num_descriptions 1 -num_alignments 1 -evalue 0.01 -db /work1/xxzw/perl_blast_plus/database/nt_fa -out /work1/xxzw/perl_blast_plus/output/test_0_fa_tblastx_nt_fa.blast -word_size 3 -num_threads 8

Everything works ok if I use a smaller database (~5mb split up into five 1mb segments for testing purposes) but when I try and use the nt database I get five blast files that only contain the line:

TBLASTX 2.2.30+

...and nothing else happens after many hours!.

When I terminate the job....no errors or clues are reported in the STDOUT STDERR files for each job.

I've checked the .nal files for the nt_fa database.....everything is fine. I've remade the nt_fa database from fasta files. Same thing.

It seems the issue is to do with the size of the nt_fa database. I've tried increasing the number of processors to 8 and the memory to 22GB in the qsub statement with no effect.

Any ideas what could be the problem?

I'm still quite new to using PBS PRO. This is basically a small scale test for future blasting of several thousand query sequences.

Any help would be appreciated.
quokka is offline   Reply With Quote
Old 03-31-2015, 06:23 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,584
Default

When you are in investigative phase, dialing back on number of threads is a prudent thing. Start with a couple of threads and at least 24G of RAM. See if that job finishes. You would want to keep the threads on a physical server since having them going to different physical servers is not a good thing (I am not sure if you are already asking PBS Pro to do that, am an LSF/SGE user).

With TBLASTX I am not sure if 24G is going to be enough. Why not try a TBLASTN against a protein database? What are you trying to map (metagenomic data)?

Last edited by GenoMax; 03-31-2015 at 06:27 PM.
GenoMax is offline   Reply With Quote
Old 04-01-2015, 01:25 AM   #3
quokka
Junior Member
 
Location: oz

Join Date: Apr 2010
Posts: 9
Default

ah.....I left it go for a long time and eventually an insufficient memory error came up.....so that was it....

It wasn't clear to me what the memory requirements for blast+ were (and whether having the database in segments would help reduce memory requirements).....now i understand a (tiny) bit better....

Yes...it's a metagenomics based application....and yes...I think I'm going to have to adopt some alternative approaches to cut down the computing requirements....

Thanks for your insights GenoMax...much appreciated..
quokka is offline   Reply With Quote
Old 04-01-2015, 07:10 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,584
Default

Was able to get a tblastx search to work against nt with 18G of RAM (and 4 threads) with a 2 kb query. That may be the best combination to start. Hopefully that will work with your cluster and PBS pro as well.

This thread has some discussion that you should consider while planning your searches: http://seqanswers.com/forums/showthread.php?t=49464
GenoMax is offline   Reply With Quote
Old 04-02-2015, 08:35 PM   #5
quokka
Junior Member
 
Location: oz

Join Date: Apr 2010
Posts: 9
Default

Thanks very much for the info Genomax....very helpful!

The queues on our cluster for machines with >24GB physical memory (22GB available) are exceptionally long.....so 18GB sounds good....

...i'll give it a go this weekend and report back.
quokka is offline   Reply With Quote
Old 04-05-2015, 02:45 PM   #6
tomc
Member
 
Location: Oregon

Join Date: Feb 2011
Posts: 29
Default

quokka we will not know your cluster's file system or network setup but in general you want the data and the search to be as close as possible, and you want to reuse the data on hand as much as possible.

note that your five 1k query sequences are insignificant compared with the 22) 1GB blast databases
so blasting all 5 sequences against whichever shard of the nt_fa was in memory is preferable.

Blasting your sequences against each of the 22 shards can happen concurrently.

Being able to access the nt_fa database from a node, is not the same as that data being local to the node. This may mean you will see a speed up by first copying the blast database to a disk on the node perhaps a "scratch" disk. (see your cluster documentation or sysadmin about )

In summary

launch 22 jobs each which pull a 1GB database shard
(into local storage if possible to avoid refetching next query)
run a search for each of your query sequences against the local shard
combines the results

Last edited by tomc; 04-05-2015 at 02:56 PM.
tomc is offline   Reply With Quote
Old 12-01-2015, 09:15 PM   #7
quokka
Junior Member
 
Location: oz

Join Date: Apr 2010
Posts: 9
Default

Ok...it took me a long time before i was able to get back to this......

I essentially tried what tomc suggested. The problem I found was that the e-values were a bit different to what I was expecting....which complicates things when looking for a best blast hit....

I gather this discrepancy results from splitting up the database...which affects database size....which influences calculation of the e-value...

mpiblast is meant to overcome this problem....but i ran into a different problem with that ( http://seqanswers.com/forums/showthread.php?t=49325 )

......more tweaking is required!
quokka is offline   Reply With Quote
Reply

Tags
blast, cluster, nt database, pbs, tblastx

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:31 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO