Hello, everyone.
I was recently introduced to the area of genomic assembly by one of the bioinformatic faculty at the university where I work in IT. I have a lot of experience in handling strings of data from 30-odd years of writing word games such as Scrabble, and I'm also the tech guy who supports our local MPI cluster. It occurred to me that this gene assembly area and my wordgame/spelling checker&corrector hobby had a lot in common, so I decided to see if there was anything I could bring to the field from my previous experience.
I think I've been successful and am currently working on a proof of concept program to test my ideas. The part of the assembly problem which interests me is the stage where you take all the reads from your sequencer, and identify the overlaps between each read. My academic colleagues are using velveth to do this part of the problem, and it appears to be quite time consuming for them. I offered to write something for them that would do this task a lot more quickly :-)
I should have my proof of concept code ready for release in a few days and I'm looking for people who've worked in this area to have a look at it and tell me if indeed I've found something new that's worth exploring, or if I've made some newbie mistake and gone down a blind alley that everyone who looks at this problem has tried and discarded :-)
If my algorithm turns out to be new & useful in this area, I'ld like to hack out a quick academic paper on it and then release the code under an open source license such as BSD.
To give you a ball-park estimate of what I believe the performance of this algorithm will be, I'm looking at something like finding the overlaps between about 30M reads of length 37, from a 100Mbp genome, in about 15 minutes.
The algorithm is linear in the number of reads and scales up linearly by adding more processors+ram for larger datasets. (My code uses MPI and OMP.)
I have to confess that having only tackled this problem for a couple of weeks and trying deliberately to _not_ read too much of the literature so as not to be too influenced by current practise, I don't know the jargon well enough yet to discuss this fluently with experts and I'm not even sure what the proper name is for the task that I'm working on! Is overlap detection called alignment perhaps, or is that something to do with the later phase where the overlaps are used to create a graph from which the genome is extracted?
Anyway, I have enjoyed the heck out of working on this stuff over the last 10 days or so and look forward to getting to know you folks working in this field; and I'm hoping I can get some peer review from y'all to evaluate the algorithms and code that I'm working on.
best regards,
Graham Toal <[email protected]>
(maintainer of the wordgame-programmers group software archives)
A Scotsman living in the south of Texas...
I was recently introduced to the area of genomic assembly by one of the bioinformatic faculty at the university where I work in IT. I have a lot of experience in handling strings of data from 30-odd years of writing word games such as Scrabble, and I'm also the tech guy who supports our local MPI cluster. It occurred to me that this gene assembly area and my wordgame/spelling checker&corrector hobby had a lot in common, so I decided to see if there was anything I could bring to the field from my previous experience.
I think I've been successful and am currently working on a proof of concept program to test my ideas. The part of the assembly problem which interests me is the stage where you take all the reads from your sequencer, and identify the overlaps between each read. My academic colleagues are using velveth to do this part of the problem, and it appears to be quite time consuming for them. I offered to write something for them that would do this task a lot more quickly :-)
I should have my proof of concept code ready for release in a few days and I'm looking for people who've worked in this area to have a look at it and tell me if indeed I've found something new that's worth exploring, or if I've made some newbie mistake and gone down a blind alley that everyone who looks at this problem has tried and discarded :-)
If my algorithm turns out to be new & useful in this area, I'ld like to hack out a quick academic paper on it and then release the code under an open source license such as BSD.
To give you a ball-park estimate of what I believe the performance of this algorithm will be, I'm looking at something like finding the overlaps between about 30M reads of length 37, from a 100Mbp genome, in about 15 minutes.
The algorithm is linear in the number of reads and scales up linearly by adding more processors+ram for larger datasets. (My code uses MPI and OMP.)
I have to confess that having only tackled this problem for a couple of weeks and trying deliberately to _not_ read too much of the literature so as not to be too influenced by current practise, I don't know the jargon well enough yet to discuss this fluently with experts and I'm not even sure what the proper name is for the task that I'm working on! Is overlap detection called alignment perhaps, or is that something to do with the later phase where the overlaps are used to create a graph from which the genome is extracted?
Anyway, I have enjoyed the heck out of working on this stuff over the last 10 days or so and look forward to getting to know you folks working in this field; and I'm hoping I can get some peer review from y'all to evaluate the algorithms and code that I'm working on.
best regards,
Graham Toal <[email protected]>
(maintainer of the wordgame-programmers group software archives)
A Scotsman living in the south of Texas...
Comment