Seqanswers Leaderboard Ad

**rskr** · 04-20-2012, 07:24 AM

Originally posted by colindaven View Post

This might be useful in metagenomics (as kopi-o mentioned) if it provided accurate output in the standard
SAM alignment format.

Certainly I have problems fitting enough microbial genomes into the 4GB reference sequence limitation of Bowtie and BWA, given there are now > 6-7 GB which are publicly available.

The most recent bwa was supposed to remove the 4GB limit.

**coatespt1848** · 04-20-2012, 01:55 PM

Lots of good advice!

Thanks to all for the thoughtful comments. Replies follow. I should make one significant correction though. The actual times I quoted for finding a multi-kilobyte target in C. Elegans and the human genome were egregiously in error. I added two decimal points. The actual times were 1/5 second for C Elegans and 6 seconds for the human genome, not 600 seconds. This suggests that for this particular kind of search, it is a good order of magnitude faster than BWA-SW, rather than a few times slower. (Whether that matters is another question. you do still have to do apply a real alignment algorithm to the resulting match.)

Colindaven and kopi-o: metagenomics sounds very much worth investigating. I'll certainly look into this.

nilshomer's and rskr's comments on memory and caching require a detailed response. Algorithm that use large amounts of memory interact with the OS and hardware in non-obvious ways. As nilshomer was pointed out, hierarchical and random access data structures tend to have low locality of reference, defeating much of the magic of the TLB and the various hardware caches. Although, with this algorithm, the entire data structure for say, the human genome, would easily fit in the hardware caches, it turns out to be largely irrelevant. With sequential access, hardware executing off to the side generally takes care of populating hardware caches in advance of requirement. Possibly equally importantly, the TLB's also experiences fewer faults because the input data is in order, allowing the circuitry that anticipates your next access to guess correctly more often. Moreover, sequentially accessed data almost always has a very low page fault rate, as every byte on each page is fully used. So with a fully sequential access, the average time-cost of reading a byte of memory can be orders of magnitude lower than in an algorithm using hashing or hierarchical data structures.

nilshomers comments about storing only diffs of highly similar genomes, can't be gainsaid. I was thinking of the case where you were looking, say, for matches among distant taxa, which may be a much less common problem.

Anyway, thanks again for the considerable time and thought you folks have put into answering these questions. Maybe it would make sense to take this input and try come up with an interesting demo---download a couple of dozen genomes and demonstrate finding some obscure gene in a few seconds.

Meanwhile, if anybody comes across any useful applications, either for searching entire genomes, or for rapidly estimating the Levenshtein distance of fragments in the kilobyte to megabyte range, I'd love to hear about it. It's gotta be good for something! I can also be reached as coatespt at g mail.

**nilshomer** · 04-20-2012, 03:38 PM

Originally posted by rskr View Post

The most recent bwa was supposed to remove the 4GB limit.

It does handle 4GB or greater!

**mchaisso** · 04-20-2012, 03:56 PM

Originally posted by nilshomer View Post

One 5Kb sequence to the human genome in 10 minutes? Try "bwa-sw" or "bowtie2" as a benchmark.

For reference, the blasr method aligns 10kb 87% accurate sequences to the human genome in about <0.5s per sequence.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 45 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 46 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 39 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News