I am looking for some guidance on assembling a small bacterial genome. This is not working out to be as straightforward as I thought it would be
The expected genome size is approximately 1MB and has a low (~30%) GC content (could this cause emulsion PCR bias, creating gaps?). This is based on 3 different published reference genomes. I have completed two 454 sequencing runs for one isolate; Run 1 28M bases 400bp average read length and Run 2 42M bases 445bp average read length. I thought this would be more than enough to get a complete assembly, not so!
So far I have only done analysis with the 454 software. I am getting really different results depending on how I go about it.
Using GSReferenceMapper with a single reference genome the best I can get is 42 large contigs, 56 total contigs, ave contig size 21352, longest contig 104874, total bases 899615. A few of these contigs contain very few reads.
By comparison, using GSReferenceMapper with all 3 reference genomes I get 184 large contigs, 252 total contigs, ave contig size 2153, longest contig 31324, total bases 416772. I would have thought a better mapping would have been achieved with more reference information? There is, however, a large inversion (almost 1/2 the genome) in one of the reference isolates.
The de novo assembler gives me 91 large contigs, 199 total contigs, ave contig size 10432, longest contig 78810, total bases 971829. For some reason I am willing to trust these contigs because they are not based on variable reference genomes.
How do I go about deciding the best way to assemble this data? Software recommendations? I am limited to Windows for now so I understand my choices are limited. Any help is appreciated.
The expected genome size is approximately 1MB and has a low (~30%) GC content (could this cause emulsion PCR bias, creating gaps?). This is based on 3 different published reference genomes. I have completed two 454 sequencing runs for one isolate; Run 1 28M bases 400bp average read length and Run 2 42M bases 445bp average read length. I thought this would be more than enough to get a complete assembly, not so!
So far I have only done analysis with the 454 software. I am getting really different results depending on how I go about it.
Using GSReferenceMapper with a single reference genome the best I can get is 42 large contigs, 56 total contigs, ave contig size 21352, longest contig 104874, total bases 899615. A few of these contigs contain very few reads.
By comparison, using GSReferenceMapper with all 3 reference genomes I get 184 large contigs, 252 total contigs, ave contig size 2153, longest contig 31324, total bases 416772. I would have thought a better mapping would have been achieved with more reference information? There is, however, a large inversion (almost 1/2 the genome) in one of the reference isolates.
The de novo assembler gives me 91 large contigs, 199 total contigs, ave contig size 10432, longest contig 78810, total bases 971829. For some reason I am willing to trust these contigs because they are not based on variable reference genomes.
How do I go about deciding the best way to assemble this data? Software recommendations? I am limited to Windows for now so I understand my choices are limited. Any help is appreciated.
Comment