I'm interested in running baseline benchmarks on several short-read alignment tools to see their differences especially as machine specs are changed (CPU type, count, speeds, RAM speeds, caches, etc.) There can be all kinds of subtle effects especially on large 24 or 32-core servers which have unique memory access problems.
I'm hoping to run Bowtie, SOAP, and Maq as my baseline tools. I'm not really comparing the three tools against each other, I'm actually more of comparing hardware effects on short read alignment in general. I'll probably focus on Bowtie just because it's so powerful (and much faster!)
My problem is (as a CS guy, not a bio guy) what test data to use as my baseline. I don't want to make my own synthetic data, I want to run a typical problem that true users of these tools submit.
Could someone recommend where I could find or how I could create a test suite of data? It would probably consist of a few tens of millions of short reads and one or more larger databases to match it against. I would probably run multiple cases, perhaps 4 runs allowing k=0,1,2,3 mismatches or something.
I notice the 2008 Bowtie paper takes samples from the 1000 genome project, trims them to 35 bases, and aligns them against the human genome reference. Is this a good test case typical of real use of aligners? Would typical uses use longer input sequences, shorter, a mix? Again I'm just looking for typical workloads where the software speed is measurable and I can just see where hardware sensitivities are.
I appreciate any help in getting these baseline benchmarks run on my hardware!
I'm hoping to run Bowtie, SOAP, and Maq as my baseline tools. I'm not really comparing the three tools against each other, I'm actually more of comparing hardware effects on short read alignment in general. I'll probably focus on Bowtie just because it's so powerful (and much faster!)
My problem is (as a CS guy, not a bio guy) what test data to use as my baseline. I don't want to make my own synthetic data, I want to run a typical problem that true users of these tools submit.
Could someone recommend where I could find or how I could create a test suite of data? It would probably consist of a few tens of millions of short reads and one or more larger databases to match it against. I would probably run multiple cases, perhaps 4 runs allowing k=0,1,2,3 mismatches or something.
I notice the 2008 Bowtie paper takes samples from the 1000 genome project, trims them to 35 bases, and aligns them against the human genome reference. Is this a good test case typical of real use of aligners? Would typical uses use longer input sequences, shorter, a mix? Again I'm just looking for typical workloads where the software speed is measurable and I can just see where hardware sensitivities are.
I appreciate any help in getting these baseline benchmarks run on my hardware!
Comment