View Single Post
Old 12-19-2014, 03:42 PM   #94
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

I really kind of like the "mm=t" option because it's similar to allowing a mismatch, but with no speed or memory penalty. It's also possible, though, to generally reduce memory consumption when allowing mismatches using the "speed" or "rskip" flag. "speed=8", for example, would hash only 50% of the kmer space, so it would halve memory consumption; "speed=12" would quarter memory consumption. Of course these reduce sensitivity, but "speed=12 hdist=1" would probably be more sensitive than "speed=0 hdist=0" if there are a lot of mismatches.

There is no reason for me to not generate read kmers with mismatches in them, other than the speed penalty. But, that speed penalty is hefty - with 31mers, allowing 1 mismatch might make the program 93x slower (particularly if most reads do not match the reference; not so much if they mostly do).

I will plan to add that capability in, though I'm not convinced that it will be viable on large datasets (raw reads). However, it WOULD be viable for matching a handful of sequences to a very large reference.
Brian Bushnell is offline   Reply With Quote