Seqanswers Leaderboard Ad

**Torst** · 05-03-2009, 05:35 PM

GerryB

Originally posted by GerryB View Post

I'd love if [b]you'd share your opinions, experiences, even HOPES for what kind of tools or speedups or new abilities you'd like.

Sequence similarity searching is always something we wish was faster. Query sets and databases are always getting larger. The topic that is neglected often is that of PROTEIN-PROTEIN similarity searches - their simpler DNA-DNA conterparts get most of the attention.

DNA-DNA is simpler, has more scope for indexing due to smaller alphabet, and is more redundant/compressible. PROTEIN on the other hand is a larger alphabet and close to incompressible, so many of the speed-up tricks don't work. The whole BLOSUM/PAM similarity matrix business also means PROTEIN searches have a larger constant factor in their time complexities.

I'm guessing you are looking at GPU implementations. You should look at HMMER. The current alpha version is poised to support SMP (multipile threads), MPI (clusters), SIMD (vector registers), GPU (CUDA?) and CellSPU (Sony PS3 etc). HMMER is a form of PROT-PROT searching using profile HMMs, and it can never be fast enough :-)

Torst

**xuying** · 05-03-2009, 09:14 PM

Creating an auto-tuning bowtie (or any other good alogrithms in genomics) may be a good idea since many of biologists only use cheap PCs to do the data crunching (the next generation sequencing). GPGPU is also a good idea, such as MUMmerGPU and "SOAP on GPU project" in UCSD (http://iacs5.ucsd.edu/~tpremchi/index.html). With the advanced in the next generation sequencing or even the next next generation one, we will soon be overwhelmed by the data deluge.

**GerryB** · 05-03-2009, 10:35 PM

Originally posted by Torst View Post

Sequence similarity searching is always something we wish was faster. Query sets and databases are always getting larger. The topic that is neglected often is that of PROTEIN-PROTEIN similarity searches - their simpler DNA-DNA conterparts get most of the attention.

DNA-DNA is simpler, has more scope for indexing due to smaller alphabet, and is more redundant/compressible. PROTEIN on the other hand is a larger alphabet and close to incompressible, so many of the speed-up tricks don't work. The whole BLOSUM/PAM similarity matrix business also means PROTEIN searches have a larger constant factor in their time complexities.

This is exactly the kind of information I'm hoping to learn from.
When you say that protein-protein searches are neglected and need to be sped up, do you mean BLAST-P like searches? Or do you mean HMM holology search? You mention HMMER, and that's a great hint to me as well (especially since I've been reading their book these past two weeks!)

What's the specific tool or search you'd like to see? Again, you can go a little wild and blue-sky dream.. My background is with huge terabyte databases with subsecond fuzzy searches. HMM style alignment is a very complex form of "fuzzy" but it's still related. Are you saying you'd just love a hmmsort program that ran 50 times faster and you'd scream in joy and email me your firstborn child? Or do you mean that P-P aligment and motif finding needs more research and better databases and "more love" like the DNA boys get?

What's interesting to me about protein is (again, biased by my background), the larger alphabet and larger scoring matrices don't seem like they'd be a big algorithmic change. Even the DNA tools I've been experimenting with are generic and deal with alphabets up to 255.
Now I realize of course I am still very ignorant of some of the practical aspects, but that's also why I'm really enjoying hearing these answers and helping to guide my own experiments (and even more papers to study.)

**GerryB** · 05-03-2009, 10:48 PM

Originally posted by xuying View Post

Creating an auto-tuning bowtie (or any other good alogrithms in genomics) may be a good idea since many of biologists only use cheap PCs to do the data crunching (the next generation sequencing). GPGPU is also a good idea, such as MUMmerGPU and "SOAP on GPU project" in UCSD (http://iacs5.ucsd.edu/~tpremchi/index.html). With the advanced in the next generation sequencing or even the next next generation one, we will soon be overwhelmed by the data deluge.

Now this is especially interesting since Bowtie alignment mapping is extremely similar to some of the database throughput issues I've dealt with before.

Bowtie's throughput can match over 30 million sequences an hour with only 1GB of RAM. Is this a big limitation? (Again, this is a serious question..)
If you have a 454 machine sitting next to you spitting out sequences, how fast does it give them to you? It seems like it can spit out about 100,000 an hour.
So my dumb question (again, laugh at my practical ignorance) is why you'd need bowtie-like short-read alignment to be much faster?

Maybe you like doing an alignment in a few hundred different ways with different fitting criteria to see if your matches are sensitive to settings?

So my question is how much you use Bowtie/SOAP like alignment programs, and how their speed affects your workflow. If your lab has 20 PCs that run continuously doing these alignments, I'm really curious why! It seems to me that the sequencing itself would be the bottleneck, and a desktop PC could keep up with the data pretty easily.
I'm sure I'm wrong about that (or you wouldn't have talked about it) so I'm hoping to learn where my misconceptions are.

Thanks!

**xuying** · 05-03-2009, 11:16 PM

Yes, sometimes we want to shuffle reads many times and map back to genome to calculate false positive rate of mapping. I am also curious how fast different kinds of algorithms can be if you implement them on different platforms (GPU or Cell). Mapping is just the first stage in analysis pipeline. When reads are getting longer, we do need faster aligners allowing more mismatches. Maybe in the near future, we will switch back to BLAST-like tools again. But concerning the number of reads, a new faster bowtie-like memory-efficient program might be more suitable for us.

**GerryB** · 05-03-2009, 11:32 PM

Originally posted by xuying View Post

Yes, sometimes we want to shuffle reads many times and map back to genome to calculate false positive rate of mapping. I am also curious how fast different kinds of algorithms can be if you implement them on different platforms (GPU or Cell). Mapping is just the first stage in analysis pipeline. When reads are getting longer, we do need faster aligners allowing more mismatches. Maybe in the near future, we will switch back to BLAST-like tools again. But concerning the number of reads, a new faster bowtie-like memory-efficient program might be more suitable for us.

This is again great information for me, thanks!
I'm really impressed by Bowtie.. they have a smooth implementation and exactly the right clever algorithm for exact and near-exact matches.

Again I'm surprised that bowtie wouldn't be fast enough for anyone though! But maybe that goes back to your desire for detecting false-positives. Perhaps what you'd rather have is a bowtie like matching with better statistics about confidence levels for which sequences are properly matched, and/or how accurate your final assembled genome is.

Just brainstorming here but I could imagine doing a sequence assembly "in real time" as the sequencer runs... you may even decide to run the sequencer for a shorter or longer time as the assembly reports the "realtime" quality of the final assembed sequence.

With respect to mismatch support, are most errors in new generation sequencers just single letter errors? Do indels ever even occur? Or are the ends of the sequences always the problem, which is why they're often stripped. (Maybe that's why you like running matches several times, to see if your stripping affected the final assembly?)

Bowtie already handles a few mismatches, but is it useful to allow a lot more generous error rate, like 10 or 20 per sequence? Or are you ideally hoping for a BLAST like alignment with gaps and multiple errors, or is that overkill for these sequence reads (I kind of expected that you never had gaps in the short read techniques.. am I wrong?)

**xuying** · 05-03-2009, 11:55 PM

You are right. Statistical support in aligning short reads is good. But we dont have too many choices at this moment. Different sequencing technology has different error model. Some will introduce indel while others produce more single nucleotide mismatches depending on how they detect signals. People trim reads because of the quality of the reads as well as different applications (for smRNAs 3' adaptor should be trimmed). We deal with 36bp reads without considering gaps (for solexa data it's fine I think). But even for longer reads, we wont go crazy for dozens of mismatches. Continuous gaps may be more important. That involves much more computation, right?

**GerryB** · 05-04-2009, 12:25 AM

Yes, gaps take more searching, but are not too much harder than single errors.
What becomes trickier is higher error rate (like 5%).. there are great algorithms for handling that, but I don't think such error rates are common in sequencing, are they?

About gaps: Are short sequence gaps always near each other? Maybe you have 20bp, then a gap of 13, then the next 40 bp kind of behavor? Or are they totally unrelated because different DNA loops rejoined, so you may have a 20bp match, a gap of 376271 all the way over to a different chromosome, then 40 more bp match?

In the "wouldn't it be nice" design category, what you'd like to see is a high-throughput short sequence aligner. One that probably matches against a base framework of a known genome (but probably you'd also like the ability for raw assembly too, with no starting framework?). You'd like to allow errors.. what rates are reasonable? 1%? 5%? You'd like to have gap support. You'd like to use untrimmed sequences, but you want the aligner to "know" the ends may be bad and not weight them if they're not consistent. You'd like some statistical feedback over how good the final assembly is.. maybe also about how good any particular sequence really matched is.

What else? This is all so interesting to me..

If you did have such a tool, how would it change your daily science? Is it all just "nice to have" or would this be a "wow, we can do so much more now, we need to buy more sequencers because it's so useful!"

Stupid side question: how much do modern sequencers even cost? If I wanted my own 454 machine. They spit out 100,000 matches an hour, does that mean I could sample my own genome with 10X coverage in about a week? (hmm, obviously not or the genomic xprize would have been won long ago)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 58 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

What are today's computational bottlenecks?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News