SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BLAST+ creating custom blast database and using blast+ filtering features deniz Bioinformatics 3 07-07-2019 09:04 AM
tblastx error-ncbi-blast-2.2.26+ bioman1 Bioinformatics 7 02-13-2013 06:04 AM
How to install NCBI BLAST 2.2.27+ on mac OS X 10.8.1 Fad2012 Bioinformatics 1 12-02-2012 03:35 PM
How would you BLAST a large de novo dataset to NCBI? grassgirl Bioinformatics 2 06-06-2011 05:10 PM
BFAST submission to SGE script rdeborja Bioinformatics 6 03-01-2011 10:14 AM

Reply
 
Thread Tools
Old 04-12-2013, 05:49 AM   #1
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Question SGE and ncbi-blast-2.2.28+

Hello,

I've predicted genes from metagenomic assemblies with FragGeneScan. The next step is to query the predicted peptides against NCBI's nr database. My cluster consists of sixteen Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, which makes up 256 threads altogether. An option would be to use the '-num_threads' flag in blast. However, in my experience, this doesn't parallelize the task entirely.

So what I'm going to do is to run blast with SGE using the below script (after modifying it suitable for blast-2.2.28+). Here's more info.

Code:
#!/bin/bash
#
#$ -cwd
#$ -S /bin/bash
#$ -j y

export BLASTDB=/share/bio/ncbi/db/
export BLASTMAT=/opt/Bio/ncbi/data/

export PATH=$PATH:/opt/Bio/ncbi/bin

blastall -d patnt -p blastn -i $HOME/test.txt -o $HOME/result.txt
I have no previous experience with SGE (all I know is that it's setup on the cluster I'm using). So my question is, should I omit the '-num_threads' flag from my query entirely?
rhinoceros is offline   Reply With Quote
Old 04-12-2013, 06:04 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Have a look at this as another option: http://www3.imperial.ac.uk/bioinfsup...ing_array_jobs

Even though you have 16 CPU's how much memory do you have available for each? You may need about ~10G per job if you are going to search against "nr".
GenoMax is offline   Reply With Quote
Old 04-12-2013, 10:02 AM   #3
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by GenoMax View Post
Have a look at this as another option: http://www3.imperial.ac.uk/bioinfsup...ing_array_jobs

Even though you have 16 CPU's how much memory do you have available for each? You may need about ~10G per job if you are going to search against "nr".
Thanks, the link is very helpful. The cluster has 256G RAM. So, I suppose a good solution would be to run 16 independent tasks with 16 threads in each.
rhinoceros is offline   Reply With Quote
Old 04-12-2013, 10:17 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Quote:
Originally Posted by rhinoceros View Post
Thanks, the link is very helpful. The cluster has 256G RAM. So, I suppose a good solution would be to run 16 independent tasks with 16 threads in each.
What kind of a cluster is this?

Most commodity clusters have nodes with a certain amount of RAM (e.g. on a cluster I access there are blades with dual quad core xeon CPU's accessing 72GB of local RAM) and then there clusters with "shared" memory access (e.g. NUMA). I have not seen cluster of the latter kind in common use of late.

Is your cluster the latter type when you say that you have 256G RAM? Or do you actually have 256G RAM on each node (not completely unlikely now-a-days)?

Unless you are the only person using this cluster you may not be able to spawn off those many jobs simultaneously. Then there will be some dependence on the type/speed of storage.

Last edited by GenoMax; 04-12-2013 at 10:21 AM.
GenoMax is offline   Reply With Quote
Old 04-12-2013, 10:37 AM   #5
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by GenoMax View Post
What kind of a cluster is this?

Most commodity clusters have nodes with a certain amount of RAM (e.g. on a cluster I access there are blades with dual quad core xeon CPU's accessing 72GB of local RAM) and then there clusters with "shared" memory access (e.g. NUMA).

Is your cluster the latter type when you say that you have 256G RAM? Or do you actually have 256G RAM on each node (not completely unlikely now-a-days)?
I'm not 100% sure, but I think the cluster consists of 16 Dell R620's, i.e. 16 GB RAM in each node..

Code:
cat /proc/meminfo
MemTotal:       264635596 kB
..
rhinoceros is offline   Reply With Quote
Old 04-12-2013, 10:54 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Quote:
Originally Posted by rhinoceros View Post
I'm not 100% sure, but I think the cluster consists of 16 Dell R620's, i.e. 16 GB RAM in each node..

Code:
cat /proc/meminfo
MemTotal:       264635596 kB
..
So you do have a cluster of the first type and the cluster head-node does seem to have 256GB RAM (assuming that is where you ran the cat command).

Not sure if your sys admins allow you to run jobs on head-node ....

If the worker nodes have only 16GB RAM each then you are not going to be able to perhaps run more than one job per node (you could but then things will use swap/tmp and everything will be slow). I suggest experimenting with test jobs allocating different memory to see if you could squeeze in two jobs per node.
GenoMax is offline   Reply With Quote
Old 04-13-2013, 08:36 AM   #7
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Hello again,

Will the following result in 16 parallel instances of blast with each instance running 16 threads? Original input.fasta has been divided into 16 files named input.1 - input.16.

qsub -t 1-16:1 blastp-sge.sh

Code:
#!/bin/bash
#$ -N blastp
#$ -j y
#$ -cwd
#$ -l h_vmem=2G -pe smp 8
#$ -R y
/path/to/ncbi-blast/2.2.28+/bin/blastp -query input.${SGE_TASK_ID} -db /path/to/db/nr -seg yes -soft_masking true -use_sw_tback -evalue 1e-5 -outfmt "6 qseqid sseqid sgi staxids pident length mismatch gapopen qstart qend sstart send evalue bitscore" -num_threads 16 -out ${SGE_TASK_ID}.tsv
Output would be 1.tsv - 16.tsv which could be merged easily. I'm having particularly hard time understanding the '#$ -l h_vmem=2G -pe smp 8' line.

Last edited by rhinoceros; 04-13-2013 at 09:25 AM.
rhinoceros is offline   Reply With Quote
Old 04-15-2013, 04:46 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Quote:
Originally Posted by rhinoceros View Post
I'm having particularly hard time understanding the '#$ -l h_vmem=2G -pe smp 8' line.
The h_vmem parameter has to do with the memory allocation for the job. This page has info about this parameter: http://www.biostat.jhsph.edu/bit/clu...e.html#MemSpec

The "pe" part refers to a parallel environment (if there is one set up on your cluster). This would be related to "num_threads" part for your blast jobs as described here: http://www3.imperial.ac.uk/bioinfsup..._parallel_jobs

You may want to confer with your local SGE admin about the right parameters to set for the queues you have access to.
GenoMax is offline   Reply With Quote
Old 04-18-2013, 10:24 AM   #9
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Smile

Everything is working now. My script blastp.sh is as follows:

Code:
#!/bin/bash
#$ -V
#$ -N blastp
#$ -j y
#$ -cwd
#$ -pe orte 16
/path/to/ncbi-blast/2.2.28+/bin/blastp -query input.${SGE_TASK_ID} -db /path/to/db/nr -lotsOfFlags -outfmt 6 -num_threads 16 -out ${SGE_TASK_ID}.tsv
The input is a fasta file that I have split to 20 parts with fastasplitn (input.1, input.2, .., input.20). I call the script from the same dir as follows: qsub -t 1-20:1 blastp.sh

So I'm running in this case 20 parallel blasts with 16 threads in each (though actually some of them are in the queue). Output is 1.tsv, 2.tsv, .., 20.tsv which I'll merge by

cat 1.tsv 2.tsv .. 20.tsv > blast_result.tsv

And that's that. I hope others might find this useful..
rhinoceros is offline   Reply With Quote
Reply

Tags
blast, sge

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO