I'm working on a string processing algorithm that for an application unrelated to biology, but I think it might possibly be useful for processing genome data. It's been surprisingly hard to find out! Hoping some bio-computing sophisticate can shed light on this.
Problem the algorithm solves now: searching huge strings to find matches to corrupted targets. You have a target sequence of anywhere from, say, half a kilobyte up to megabytes (e.g a sequence of base pairs or any other character data.) You want to find partial matches to the target in strings that are of any length (eg. multi-gigabytes). Caveat: this algorithm does not deal well with short targets (say, under a couple of hundred chars) but it's fast at finding longer strings that are significantly corrupted or changed.
One reason it seems like it might be useful: a linear-time pre-processing step results in meta-data that is tiny compared to the input. Depending on parameters, maybe 1/50 to 1/1000 of the original size. Once the pre-processing is done, you only deal with the metadata for searches, so you can search against hundreds of gigabytes of genomes in memory, even on a small platform.
This is not an alignment algorithm per se--it only finds the location where a substring that will give a good alignment exits. (more precisely, it finds substrings where the edit-distance of a substring is less than some parameter you give it.)
My question is several-fold:
So in other words, my question is, would a function like the following have applications, and if so, how fast would it need to be, in order to be useful. I'm making no assumptions about the alphabet (it could be {G,A,T,C} or any other set of characters.
The algorithmic complexity is hard to give simply, because there are several parameters, but in a nutshell, computing the meta-data is more or less linear (you can compute it approximately as fast as you can read it.) An actual search is linear in the size of the meta-data, but varies significantly with target size.
A crude demo on one processing thread searches the 100mbp C. Elegans genome for five-kbp sequences in about 3 seconds, returning the location of any sub-sequence within a approximately a given edit-distance of the target. Extrapolating, the same search against the human genome takes about 10 minutes. In practice, you can use as many hardware cores as you want, so on an N-core server, you can do about N times as fast.
If you got this far, you're patient indeed! Can you offer any insight, even if it's only to say "don't waste your time---it's a solved problem!"
Problem the algorithm solves now: searching huge strings to find matches to corrupted targets. You have a target sequence of anywhere from, say, half a kilobyte up to megabytes (e.g a sequence of base pairs or any other character data.) You want to find partial matches to the target in strings that are of any length (eg. multi-gigabytes). Caveat: this algorithm does not deal well with short targets (say, under a couple of hundred chars) but it's fast at finding longer strings that are significantly corrupted or changed.
One reason it seems like it might be useful: a linear-time pre-processing step results in meta-data that is tiny compared to the input. Depending on parameters, maybe 1/50 to 1/1000 of the original size. Once the pre-processing is done, you only deal with the metadata for searches, so you can search against hundreds of gigabytes of genomes in memory, even on a small platform.
This is not an alignment algorithm per se--it only finds the location where a substring that will give a good alignment exits. (more precisely, it finds substrings where the edit-distance of a substring is less than some parameter you give it.)
My question is several-fold:
- Am I correct that finding the location of the alignment cheaply is half the battle? (for some definition of "half".)
- If this is true, how FAST does finding the align-able substring have to be to be useful? I believe that most procedures for finding these substrings are significantly worse than linear--this might be a false assumption.
- A big-Oh answer would be useful. Answers in terms of clock-time might also help. E.g., do existing algorithms take seconds, minutes, or hours to find matches in a gigabyte?
So in other words, my question is, would a function like the following have applications, and if so, how fast would it need to be, in order to be useful. I'm making no assumptions about the alphabet (it could be {G,A,T,C} or any other set of characters.
- You give it a target string, plus a parameter that indicates the minimum quality you require of matches.
- The minimum quality match is actually the "edit" distance from the target.
- You also have to give it the names of the genomes you want searched.
- You get back a list of indexes into the genomes where the matches occur.
- Of course this assumes you've already pre-processed the genomes in question. If not, naturally, you need to supply them the first time.
The algorithmic complexity is hard to give simply, because there are several parameters, but in a nutshell, computing the meta-data is more or less linear (you can compute it approximately as fast as you can read it.) An actual search is linear in the size of the meta-data, but varies significantly with target size.
A crude demo on one processing thread searches the 100mbp C. Elegans genome for five-kbp sequences in about 3 seconds, returning the location of any sub-sequence within a approximately a given edit-distance of the target. Extrapolating, the same search against the human genome takes about 10 minutes. In practice, you can use as many hardware cores as you want, so on an N-core server, you can do about N times as fast.
If you got this far, you're patient indeed! Can you offer any insight, even if it's only to say "don't waste your time---it's a solved problem!"
Comment