SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
General question: human CNV/Structural variants algorithms using next-gen data cannot CNVboy Bioinformatics 12 04-10-2012 04:56 AM
Workshops and training programs: online or in-person for the 454 Junior RubyTuesday 454 Pyrosequencing 0 10-26-2011 11:00 AM
Comparison of alignment algorithms rboettcher Bioinformatics 5 04-11-2011 01:04 AM
The Algorithms used in Bowtie. MicroHimalaya Bioinformatics 0 07-31-2009 12:38 AM
Which assembly algorithms to benchmark? lgoerlitz Bioinformatics 0 06-04-2009 05:32 AM

Reply
 
Thread Tools
Old 04-18-2012, 12:21 PM   #1
coatespt1848
Junior Member
 
Location: new york

Join Date: Apr 2012
Posts: 7
Default Serious question for an algorithms person

I'm working on a string processing algorithm that for an application unrelated to biology, but I think it might possibly be useful for processing genome data. It's been surprisingly hard to find out! Hoping some bio-computing sophisticate can shed light on this.

Problem the algorithm solves now: searching huge strings to find matches to corrupted targets. You have a target sequence of anywhere from, say, half a kilobyte up to megabytes (e.g a sequence of base pairs or any other character data.) You want to find partial matches to the target in strings that are of any length (eg. multi-gigabytes). Caveat: this algorithm does not deal well with short targets (say, under a couple of hundred chars) but it's fast at finding longer strings that are significantly corrupted or changed.

One reason it seems like it might be useful: a linear-time pre-processing step results in meta-data that is tiny compared to the input. Depending on parameters, maybe 1/50 to 1/1000 of the original size. Once the pre-processing is done, you only deal with the metadata for searches, so you can search against hundreds of gigabytes of genomes in memory, even on a small platform.

This is not an alignment algorithm per se--it only finds the location where a substring that will give a good alignment exits. (more precisely, it finds substrings where the edit-distance of a substring is less than some parameter you give it.)

My question is several-fold:
  • Am I correct that finding the location of the alignment cheaply is half the battle? (for some definition of "half".)
  • If this is true, how FAST does finding the align-able substring have to be to be useful? I believe that most procedures for finding these substrings are significantly worse than linear--this might be a false assumption.
  • A big-Oh answer would be useful. Answers in terms of clock-time might also help. E.g., do existing algorithms take seconds, minutes, or hours to find matches in a gigabyte?

So in other words, my question is, would a function like the following have applications, and if so, how fast would it need to be, in order to be useful. I'm making no assumptions about the alphabet (it could be {G,A,T,C} or any other set of characters.
  • You give it a target string, plus a parameter that indicates the minimum quality you require of matches.
  • The minimum quality match is actually the "edit" distance from the target.
  • You also have to give it the names of the genomes you want searched.
  • You get back a list of indexes into the genomes where the matches occur.
  • Of course this assumes you've already pre-processed the genomes in question. If not, naturally, you need to supply them the first time.

The algorithmic complexity is hard to give simply, because there are several parameters, but in a nutshell, computing the meta-data is more or less linear (you can compute it approximately as fast as you can read it.) An actual search is linear in the size of the meta-data, but varies significantly with target size.

A crude demo on one processing thread searches the 100mbp C. Elegans genome for five-kbp sequences in about 3 seconds, returning the location of any sub-sequence within a approximately a given edit-distance of the target. Extrapolating, the same search against the human genome takes about 10 minutes. In practice, you can use as many hardware cores as you want, so on an N-core server, you can do about N times as fast.

If you got this far, you're patient indeed! Can you offer any insight, even if it's only to say "don't waste your time---it's a solved problem!"
coatespt1848 is offline   Reply With Quote
Old 04-18-2012, 02:31 PM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

One 5Kb sequence to the human genome in 10 minutes? Try "bwa-sw" or "bowtie2" as a benchmark.
nilshomer is offline   Reply With Quote
Old 04-18-2012, 03:09 PM   #3
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

I don't think finding the general location is considered that much of a problem especially for longer searches. BLAST for example just uses a hash, where you can pick the word sizes, then it has some heuristics about how many hash hits are within a certain distance. There are other algorithms based on Burrows Wheeler, basically reverse engineering data compression algorithms to do search are very popular these days bwa bowtie... I think what most people in Bioinformatics are more interested in is accuracy, and if you could prove that your algorithm did better at finding the right sequence with large amounts of non-homologous regions or error then that would be interesting. Usually for longer sequences though the dynamic programming algorithms or approximations dominate the complexity for finding the actual alignments.
rskr is offline   Reply With Quote
Old 04-19-2012, 06:05 AM   #4
coatespt1848
Junior Member
 
Location: new york

Join Date: Apr 2012
Posts: 7
Default A little more detail

Thanks for the tips. I'll try to scare up those benchmarks you refer to!

Regarding accuracy, to clarify, the algorithm produces a very compressed metadata from which you can estimate the edit distance of the target from each region of the search data. What you''re actually computing underneath is an estimate of the Levenshtein edit distance of the target from each region of the test data.

Thus, you can choose to exclude essentially all spurious matches by choosing a small edit distance, or allow a large edit distance to find very distant resemblances.

Actually, saying it's like LD isn't quite accurate. If the match had large blocks shuffled around with respect to their order in the target, it gives you an estimate of the LD of the match with the blocks in the best order.

One other clarification---the 10 minutes is for a single laptop thread. If your ran it on a 16 core server, the number would be 0.625 minutes. And so on. But on to those benchmarks!!!
coatespt1848 is offline   Reply With Quote
Old 04-19-2012, 06:27 AM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

I would hypothesize that 10 minutes on a single thread is a few orders of magnitude slower than the BWT (FM-index) based tools.
nilshomer is offline   Reply With Quote
Old 04-19-2012, 07:12 AM   #6
coatespt1848
Junior Member
 
Location: new york

Join Date: Apr 2012
Posts: 7
Default

Thanks for the information.

Thanks a lot for the information. That's pretty amazing---order of a second or better to search three gigs of data for a near match to a string?

As I understand it, BWA for shorter targets, and this is for long targets, but that definitely gives me a ballpark idea that my gizmo is probably not going to be an advance on the state of the art!)

I'd be really curious about how the same is done for longer sequences (half a K to a meg), if anyone knows.
coatespt1848 is offline   Reply With Quote
Old 04-19-2012, 07:18 AM   #7
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

BWA contains two algorithms, one for short sequences (less than 100bp) and one for longer sequences (100bp to 1Mb or more). Try "bwa bwasw" and see this paper: http://bioinformatics.oxfordjournals.../26/5/589.long
nilshomer is offline   Reply With Quote
Old 04-19-2012, 07:51 AM   #8
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

You should be aware of other tools for matching long sequence matching. E.g., MUMMER is designed for genome-scale comparisons.
HESmith is offline   Reply With Quote
Old 04-19-2012, 08:25 AM   #9
coatespt1848
Junior Member
 
Location: new york

Join Date: Apr 2012
Posts: 7
Default

Guys:
Thanks so much for this. This information is difficult to excavate for an outsider. I'll take a look at these sources. The original purpose was near duplicate detection in web-queries, social media, etc. and thought some of the techniques might translate.

FYI, in case there's any cross fertilization possible (to put it in bio terms!) I am specifically working on how to clean up the results of ultra high-speed duplicate detection for Web crawlers. In the abstract, the task sounds a lot like ID'ing alignment candidates. Hence this discussion, which several people have kindly given some time to.

A Web crawler gets a web page, then it has to find out if anything very much like it has been seen before among hundreds of billions of other pages. (takes under 100 ms, but with a lot of errors.) Results are very coarse. My algo is to allow a second, highly accurate estimate of the match quality of the candidate duplicates. It takes millions of times too long to fetch the articles and directly compare, but you can estimate their Levenshtein distance of pairs of pages fairly accurately, using only the metadata, in microseconds, with the metadata being only about 0.005 the size of the original.

So I was casting around for bio applications for this by applying the same technique to a sequential search of a genome. But it sounds like the existing techniques might be too good already for it to be worth pursuing. If anyone has any other ideas for the strategy, do let me know!

Thanks again
coatespt1848 is offline   Reply With Quote
Old 04-19-2012, 09:23 AM   #10
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

The literature could be dense especially if you don't understand the basics of text compression, most of us are just glad that we can use the program
rskr is offline   Reply With Quote
Old 04-19-2012, 01:45 PM   #11
coatespt1848
Junior Member
 
Location: new york

Join Date: Apr 2012
Posts: 7
Default

All:
The article cited by nilshomer http://bioinformatics.oxfordjournals.../26/5/589.long is GREAT! Answered exactly what I was asking. Moreover, it answers it encouragingly, because it reveals the sweet spot for this admittedly eccentric approach. Punch line at the end.

It's not a perfect comparison, but I read the tables correctly, the algorithm appears to be about as much faster than BLAT as it is slower than BWA-SW for finding for appropriate targets. So in it's current form, it's slower than the fastest algorithm by a factor of maybe 5.

Happily, the performance numbers I gave assumed that there was no lower bound on how bad the matches one is seeking are. If you're only interested in matches where a bounded distance is allowed, it's many times faster, which would put it in the same general ballpark with BWA-SW. (Of course, it's just a crude java implementation on a slower platform, so it might be pretty close even without requiring a lower bound on match quality.)

The possible sweet spot I see is that as I surmised, BWA-SW takes significantly more memory than the genome itself. This algorithm takes only a small fraction of the size of the genome.

This means that if you wanted to search, say, hundreds of genomes, it would have a crushing advantage in IO. BWA-SW could realistically only fit about one, possibly two, human genomes in memory, so you'd have to do a tremendous amount of of searching to amortize the IO cost of loading one genome.

The algorithm I'm talking about can fit the meta-data for about 200 to 400 genomes into the same amount of memory that BWA-SW uses for one. That means that if you ever have to do searches against hundreds of genomes, BWA-SW would be at an enormous disadvantage.

As a non-biologist, I don't know if searching many genomes is a real problem. But a quick look suggests that the algorithm could be faster than existing techniques in any case where you were doing:
  • Few searches per session (because of the load time for the BWA-SW data
  • Searching against multiple genomes.
  • Were constrained in the amount of hardware you could apply to the problem. E.g., you could easily allow searches against 100 genomes on a laptop rather than a jumbo server. Or say, 1000 genomes on a machine big enough to hold a few terabytes of disk.

Something to think about, anyway.

Thanks for all the advice.
coatespt1848 is offline   Reply With Quote
Old 04-19-2012, 02:14 PM   #12
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Actually BWA-SW takes most of the memory for indexing, not searching, and it is possible to build the indexes in pieces then merge them on standard memory machines, which is useful in assembly algorithms, but I am not sure that people really have a need in just searching so they haven't developed it just yet.
rskr is offline   Reply With Quote
Old 04-19-2012, 11:32 PM   #13
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Mapping, for example, is against one human genome, not hundreds, so there may not be such an advantage. Also, disk IO is not a big problem in mapping, but really it's the random memory access that can be a killer. So if you could reduce the size of one genome such that searching it fits in the L1/2/3 cache then there would be a huge speedup in generating approximate locations.

Another approach for mapping 100s of genomes is only storing the differences between the genomes assuming they are 99.9% similar (which they are). Having bounds on edit distance is tricky, since suppose there are N edits in a L base read, and there is a hit with N edits, but the true location has N+1 edits. If we search up to N edits, then we have no clue about the quality of the hits (were they truly unique or was the next best really close in terms of the # of edits). Specificity is key here.
nilshomer is offline   Reply With Quote
Old 04-20-2012, 02:59 AM   #14
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

There may be an need for this kind of thing in metagenomics (e g sequencing environmental samples), where you sometimes want to map to multiple genomes (even 100s of genomes) at once.
kopi-o is offline   Reply With Quote
Old 04-20-2012, 03:30 AM   #15
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 413
Default

This might be useful in metagenomics (as kopi-o mentioned) if it provided accurate output in the standard
SAM alignment format.

Certainly I have problems fitting enough microbial genomes into the 4GB reference sequence limitation of Bowtie and BWA, given there are now > 6-7 GB which are publicly available.
colindaven is offline   Reply With Quote
Old 04-20-2012, 07:24 AM   #16
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by colindaven View Post
This might be useful in metagenomics (as kopi-o mentioned) if it provided accurate output in the standard
SAM alignment format.

Certainly I have problems fitting enough microbial genomes into the 4GB reference sequence limitation of Bowtie and BWA, given there are now > 6-7 GB which are publicly available.

The most recent bwa was supposed to remove the 4GB limit.
rskr is offline   Reply With Quote
Old 04-20-2012, 01:55 PM   #17
coatespt1848
Junior Member
 
Location: new york

Join Date: Apr 2012
Posts: 7
Default Lots of good advice!

Thanks to all for the thoughtful comments. Replies follow. I should make one significant correction though. The actual times I quoted for finding a multi-kilobyte target in C. Elegans and the human genome were egregiously in error. I added two decimal points. The actual times were 1/5 second for C Elegans and 6 seconds for the human genome, not 600 seconds. This suggests that for this particular kind of search, it is a good order of magnitude faster than BWA-SW, rather than a few times slower. (Whether that matters is another question. you do still have to do apply a real alignment algorithm to the resulting match.)

Colindaven and kopi-o: metagenomics sounds very much worth investigating. I'll certainly look into this.

nilshomer's and rskr's comments on memory and caching require a detailed response. Algorithm that use large amounts of memory interact with the OS and hardware in non-obvious ways. As nilshomer was pointed out, hierarchical and random access data structures tend to have low locality of reference, defeating much of the magic of the TLB and the various hardware caches. Although, with this algorithm, the entire data structure for say, the human genome, would easily fit in the hardware caches, it turns out to be largely irrelevant. With sequential access, hardware executing off to the side generally takes care of populating hardware caches in advance of requirement. Possibly equally importantly, the TLB's also experiences fewer faults because the input data is in order, allowing the circuitry that anticipates your next access to guess correctly more often. Moreover, sequentially accessed data almost always has a very low page fault rate, as every byte on each page is fully used. So with a fully sequential access, the average time-cost of reading a byte of memory can be orders of magnitude lower than in an algorithm using hashing or hierarchical data structures.

nilshomers comments about storing only diffs of highly similar genomes, can't be gainsaid. I was thinking of the case where you were looking, say, for matches among distant taxa, which may be a much less common problem.

Anyway, thanks again for the considerable time and thought you folks have put into answering these questions. Maybe it would make sense to take this input and try come up with an interesting demo---download a couple of dozen genomes and demonstrate finding some obscure gene in a few seconds.

Meanwhile, if anybody comes across any useful applications, either for searching entire genomes, or for rapidly estimating the Levenshtein distance of fragments in the kilobyte to megabyte range, I'd love to hear about it. It's gotta be good for something! I can also be reached as coatespt at g mail.
coatespt1848 is offline   Reply With Quote
Old 04-20-2012, 03:38 PM   #18
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by rskr View Post
The most recent bwa was supposed to remove the 4GB limit.
It does handle 4GB or greater!
nilshomer is offline   Reply With Quote
Old 04-20-2012, 03:56 PM   #19
mchaisso
Member
 
Location: Seattle, WA

Join Date: Apr 2008
Posts: 84
Default

Quote:
Originally Posted by nilshomer View Post
One 5Kb sequence to the human genome in 10 minutes? Try "bwa-sw" or "bowtie2" as a benchmark.
For reference, the blasr method aligns 10kb 87% accurate sequences to the human genome in about <0.5s per sequence.
mchaisso is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO