Seqanswers Leaderboard Ad

**yaximik** · 01-03-2013, 06:45 AM

Thanks a lo for useful inputs. As I mentioned, the main challenge is that I have to wander without a reference sequence, which leaves only a few options for de novo discovery. BLAT is indeed one of attractive options, I have not tried it yet as I need a local install due to data volumes. This data set was generated after only a few MiSeq runs, so as data set grows a cluster or cloud is unavoidable, but I need to design a smart heuristic de novo strategy, otherwise any capability can be eventually saturated in some way. That is what I am trying to do using the local machine. De novo discovery has a dedicated forum, but I could not find much discussion there. Any ideas are appreciated and taken into consideration.

**pallevillesen** · 01-04-2013, 12:30 AM

I'm not sure I complete understand what you actually try to do?

De novo discovery of what? (genes?) You're sequencing an organism with unknown genome sequence? RNA-seq? I can't see what kind of data it is you're blasting.

**yaximik** · 01-04-2013, 06:17 AM

I'm not sure I complete understand what you actually try to do?

This and a first much smaller datasets were obtained from ancient DNA extracted from about 1000 yr old bone. I examined the smaller dataset using online tools, first using blastn, and obviously it was heavily contaminated with bacterial DNA. However, even at that point, many reads or obtained contigs did not give perfect match to known sequences, and many did not match to anything at all. I used online DeconSeq to filter out known bacterial, archeal, viral, human and mouse sequences from the smaller dataset at 94% identity threshold. The most of remaining 80% did not match to anything using blastn even at very loose criteria. When I lowered the identity threshold to 70%, about 60% of the dataset still passed the filter.

I did not use this strategy on the bigger dataset yet as all online servers kick me out, so it is likely quite contaminated with known sequences. What I want to know eventually is what are those sequences that do not match to anything, whether they carry any biologically meaningful (from my point of understanding) information or simply represent artefactual junk (again from my point of understanding). As I know from analysis of the smaller dataset that blastn cannot give me any clue, I want to try blastx in hope that it will be more sensitive in detecting biologically meaningful information. Does this make sense?

**yaximik** · 01-04-2013, 06:35 AM

Originally posted by pallevillesen View Post

So have you simply tested the

-num_threads <Integer, >=1>
Number of threads (CPUs) to use in the BLAST search
Default = `1'

option as suggested? And how did it go?

Anyway. for blasting 300k sequences I would find a cluster somewhere and parallelize it on as many nodes as possible.

As I am self educated user negotiating steep learning curves, I am not sure I am clear on the use multithreading. Does this apply to cores (the current box has 16) or CPUs (it has 2)? What happens if I specify 16 when the box actually has only 2 CPUs?

I just found that T610 cannot accept any of GPUs either due to inadequate PCI slots or lack of needed power connectors. Two options are either use remaining in production vCORE GPU servers, which can be connected to T610, or to get entirely separate GPU server. The first option is about to be discontinued, and the second is more expensive. Any advice on reliable GPU server vendors? Unfortunately, many of them are Windows-based, which not a good option for bioinformatics.

**GenoMax** · 01-04-2013, 06:47 AM

Originally posted by yaximik View Post

As I am self educated user negotiating steep learning curves, I am not sure I am clear on the use multithreading. Does this apply to cores (the current box has 16) or CPUs (it has 2)? What happens if I specify 16 when the box actually has only 2 CPUs?

I just found that T610 cannot accept any of GPUs either due to inadequate PCI slots or lack of needed power connectors. Two options are either use remaining in production vCORE GPU servers, which can be connected to T610, or to get entirely separate GPU server. The first option is about to be discontinued, and the second is more expensive. Any advice on reliable GPU server vendors? Unfortunately, many of them are Windows-based, which not a good option for bioinformatics.

Multi-threading applies to cores on your CPU. That said it should be noted that you can't expect to get a 16-fold increase in speed for your blast searches even if you were to use all 16 cores. Other factors (mainly I/O limitations, disk/bus-bandwidth) are going to come into play quickly, which will limit how fast you can compute on the T610.

If this is a one time project then perhaps splitting the workload across a cluster (as others have suggested previously) may be the most efficient route to follow.

If this is a personally owned machine then you could always look at upgrading components (motherboard/power supply) and pursue the GPU compute route.

**yaximik** · 01-04-2013, 06:58 AM

If this is a personally owned machine then you could always look at upgrading components (motherboard/power supply) and pursue the GPU compute route.

Thanks GenoMax, I did not think about that route.

**severin** · 01-09-2013, 12:56 PM

retraction

Originally posted by yaximik View Post

Thanks for input, that is overwhelming how much I need to learn. Before I got MiSeq, I was using online BLAST with its pretty interface. Once NCBI kicked me out because of sharply increased data volume, I had to make a local install. Now it is time to learn nuts and bolts of CLI. Thanks for advice, maupb and mjp, it is time to digest the manual.

Wow, that looks very promising! Could you enlighten me a bit more, as this is yet another set of nuts and bolts I have to learn on the fly. GPU - graphic processor unit, correct? Perhaps I could find and install one, but how do you plug it in to work with blast+, I presume instead of motherboard CPU?

I thought I had figured it all out too but It appears upon closer inspection that if I run the same set through with multiple CPUs there is not a significant speed up. In fact I found that with 32 cpu it was faster than the gpu. I am looking into other ways of speeding up blast as well or perhaps using a different program all together.

**Kennels** · 01-09-2013, 09:33 PM

A while ago I contacted NCBI about this and here was their reply:

####################
Not all phases of the algorithm are multi-threaded, which often means that even with "-num_threads" set >1, only 1 cpu is used. You might try formatting the large fasta file as a database and run the gene sequence against that, but that may not work either. It often takes a large input file against a large database to invoke multi-threading; it is not implemented for the traceback or formatting phases.
#####################

There apparently are other versions of blast which speed it up (e.g. mpiBlast, but this is probably an overkill), so as others have mentioned it probably is fastest if you split your inputs, giving them all 1 or 2 cores and working in batches simultaneously.

**narain** · 01-10-2013, 01:46 PM

Try mpiBLAST.

**yaximik** · 01-18-2013, 07:23 AM

Try mpiBLAST.

My understanding open source is pretty old (v.1.6 was released in 2010) and is based on legacy blast (2.2.20), not blast+. I guess the only option for a while is use its commercial spin-off AbokiaBLAST. Any experience with it anyone - is it worth its cost?

**balaji** · 01-31-2013, 12:51 AM

How about using usearch , they compare usearch with blastx (sensitivity vs speed), I have used usearch for short nucleotides of 50bp long and searched 3 million vs 3 million. You may consider this instead of blastx.

**yaximik** · 01-31-2013, 07:18 AM

How about using usearch ,

Well, how much is licensing for 64-bit?

**balaji** · 02-01-2013, 01:00 AM

I am using 34-bit, about the cost for 64-bit it costs nearly 900$

**OpenHero** · 08-29-2013, 08:01 AM

Maybe you can try
GPU blastn

G-BLASTN

http://www.comp.hkbu.edu.hk/~chxw/software/G-BLASTN.html

**rhinoceros** · 08-29-2013, 12:45 PM

Originally posted by Kennels View Post

There apparently are other versions of blast which speed it up (e.g. mpiBlast, but this is probably an overkill), so as others have mentioned it probably is fastest if you split your inputs, giving them all 1 or 2 cores and working in batches simultaneously.

This. For example, running four 4-thread blasts in parallel is much faster than running a single 16-thread blast (needs a lot more RAM too and IO that is not a bottleneck). I've been meaning to experiment on this with Gnu Parallel, although I usually do my blasts on a cluster by SGE..

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News