Seqanswers Leaderboard Ad

**nilshomer** · 04-18-2012, 02:31 PM

One 5Kb sequence to the human genome in 10 minutes? Try "bwa-sw" or "bowtie2" as a benchmark.

**rskr** · 04-18-2012, 03:09 PM

I don't think finding the general location is considered that much of a problem especially for longer searches. BLAST for example just uses a hash, where you can pick the word sizes, then it has some heuristics about how many hash hits are within a certain distance. There are other algorithms based on Burrows Wheeler, basically reverse engineering data compression algorithms to do search are very popular these days bwa bowtie... I think what most people in Bioinformatics are more interested in is accuracy, and if you could prove that your algorithm did better at finding the right sequence with large amounts of non-homologous regions or error then that would be interesting. Usually for longer sequences though the dynamic programming algorithms or approximations dominate the complexity for finding the actual alignments.

**coatespt1848** · 04-19-2012, 06:05 AM

A little more detail

Thanks for the tips. I'll try to scare up those benchmarks you refer to!

Regarding accuracy, to clarify, the algorithm produces a very compressed metadata from which you can estimate the edit distance of the target from each region of the search data. What you''re actually computing underneath is an estimate of the Levenshtein edit distance of the target from each region of the test data.

Thus, you can choose to exclude essentially all spurious matches by choosing a small edit distance, or allow a large edit distance to find very distant resemblances.

Actually, saying it's like LD isn't quite accurate. If the match had large blocks shuffled around with respect to their order in the target, it gives you an estimate of the LD of the match with the blocks in the best order.

One other clarification---the 10 minutes is for a single laptop thread. If your ran it on a 16 core server, the number would be 0.625 minutes. And so on. But on to those benchmarks!!!

**nilshomer** · 04-19-2012, 06:27 AM

I would hypothesize that 10 minutes on a single thread is a few orders of magnitude slower than the BWT (FM-index) based tools.

**coatespt1848** · 04-19-2012, 07:12 AM

Thanks for the information.

Thanks a lot for the information. That's pretty amazing---order of a second or better to search three gigs of data for a near match to a string?

As I understand it, BWA for shorter targets, and this is for long targets, but that definitely gives me a ballpark idea that my gizmo is probably not going to be an advance on the state of the art!)

I'd be really curious about how the same is done for longer sequences (half a K to a meg), if anyone knows.

**nilshomer** · 04-19-2012, 07:18 AM

BWA contains two algorithms, one for short sequences (less than 100bp) and one for longer sequences (100bp to 1Mb or more). Try "bwa bwasw" and see this paper: http://bioinformatics.oxfordjournals.../26/5/589.long

**HESmith** · 04-19-2012, 07:51 AM

You should be aware of other tools for matching long sequence matching. E.g., MUMMER is designed for genome-scale comparisons.

**coatespt1848** · 04-19-2012, 08:25 AM

Guys:
Thanks so much for this. This information is difficult to excavate for an outsider. I'll take a look at these sources. The original purpose was near duplicate detection in web-queries, social media, etc. and thought some of the techniques might translate.

FYI, in case there's any cross fertilization possible (to put it in bio terms!) I am specifically working on how to clean up the results of ultra high-speed duplicate detection for Web crawlers. In the abstract, the task sounds a lot like ID'ing alignment candidates. Hence this discussion, which several people have kindly given some time to.

A Web crawler gets a web page, then it has to find out if anything very much like it has been seen before among hundreds of billions of other pages. (takes under 100 ms, but with a lot of errors.) Results are very coarse. My algo is to allow a second, highly accurate estimate of the match quality of the candidate duplicates. It takes millions of times too long to fetch the articles and directly compare, but you can estimate their Levenshtein distance of pairs of pages fairly accurately, using only the metadata, in microseconds, with the metadata being only about 0.005 the size of the original.

So I was casting around for bio applications for this by applying the same technique to a sequential search of a genome. But it sounds like the existing techniques might be too good already for it to be worth pursuing. If anyone has any other ideas for the strategy, do let me know!

Thanks again

**rskr** · 04-19-2012, 09:23 AM

The literature could be dense especially if you don't understand the basics of text compression, most of us are just glad that we can use the program

**coatespt1848** · 04-19-2012, 01:45 PM

All:
The article cited by nilshomer http://bioinformatics.oxfordjournals.../26/5/589.long is GREAT! Answered exactly what I was asking. Moreover, it answers it encouragingly, because it reveals the sweet spot for this admittedly eccentric approach. Punch line at the end.

It's not a perfect comparison, but I read the tables correctly, the algorithm appears to be about as much faster than BLAT as it is slower than BWA-SW for finding for appropriate targets. So in it's current form, it's slower than the fastest algorithm by a factor of maybe 5.

Happily, the performance numbers I gave assumed that there was no lower bound on how bad the matches one is seeking are. If you're only interested in matches where a bounded distance is allowed, it's many times faster, which would put it in the same general ballpark with BWA-SW. (Of course, it's just a crude java implementation on a slower platform, so it might be pretty close even without requiring a lower bound on match quality.)

The possible sweet spot I see is that as I surmised, BWA-SW takes significantly more memory than the genome itself. This algorithm takes only a small fraction of the size of the genome.

This means that if you wanted to search, say, hundreds of genomes, it would have a crushing advantage in IO. BWA-SW could realistically only fit about one, possibly two, human genomes in memory, so you'd have to do a tremendous amount of of searching to amortize the IO cost of loading one genome.

The algorithm I'm talking about can fit the meta-data for about 200 to 400 genomes into the same amount of memory that BWA-SW uses for one. That means that if you ever have to do searches against hundreds of genomes, BWA-SW would be at an enormous disadvantage.

As a non-biologist, I don't know if searching many genomes is a real problem. But a quick look suggests that the algorithm could be faster than existing techniques in any case where you were doing:

Few searches per session (because of the load time for the BWA-SW data
Searching against multiple genomes.
Were constrained in the amount of hardware you could apply to the problem. E.g., you could easily allow searches against 100 genomes on a laptop rather than a jumbo server. Or say, 1000 genomes on a machine big enough to hold a few terabytes of disk.

Something to think about, anyway.

Thanks for all the advice.

**rskr** · 04-19-2012, 02:14 PM

Actually BWA-SW takes most of the memory for indexing, not searching, and it is possible to build the indexes in pieces then merge them on standard memory machines, which is useful in assembly algorithms, but I am not sure that people really have a need in just searching so they haven't developed it just yet.

**nilshomer** · 04-19-2012, 11:32 PM

Mapping, for example, is against one human genome, not hundreds, so there may not be such an advantage. Also, disk IO is not a big problem in mapping, but really it's the random memory access that can be a killer. So if you could reduce the size of one genome such that searching it fits in the L1/2/3 cache then there would be a huge speedup in generating approximate locations.

Another approach for mapping 100s of genomes is only storing the differences between the genomes assuming they are 99.9% similar (which they are). Having bounds on edit distance is tricky, since suppose there are N edits in a L base read, and there is a hit with N edits, but the true location has N+1 edits. If we search up to N edits, then we have no clue about the quality of the hits (were they truly unique or was the next best really close in terms of the # of edits). Specificity is key here.

**kopi-o** · 04-20-2012, 02:59 AM

There may be an need for this kind of thing in metagenomics (e g sequencing environmental samples), where you sometimes want to map to multiple genomes (even 100s of genomes) at once.

**colindaven** · 04-20-2012, 03:30 AM

This might be useful in metagenomics (as kopi-o mentioned) if it provided accurate output in the standard
SAM alignment format.

Certainly I have problems fitting enough microbial genomes into the 4GB reference sequence limitation of Bowtie and BWA, given there are now > 6-7 GB which are publicly available.

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 26 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Serious question for an algorithms person

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News