Hello Everyone,
I am in the early stages of a comparative analysis of several strains of E. coli, and I will admit up-front that I am not a bioinformatician. We already published the draft genome sequences (de novo assembly with Velvet of Illumina 2x101 PE reads), and the quality of the assemblies based on N50 and number of contigs was okay--the assemblies gave our lab most of the information we were interested in at an early date, such as protein-encoding genes and metabolic reconstructions.
However, in an effort to answer some more interesting questions about variable phenotypes between strains, we are performing a comparative genomics analysis. I am concerned about information that is lost due to the lack of synteny that afflicts many draft genomes. Furthermore, I have encountered a few algorithms thus far that require fully closed genomes as input (because they take synteny into account), and it has led me to ask the following questions for bacterial genomes:
1. Is there a proven pipeline for closing "simple" bacterial genomes (E. coli, where many reference strains have already been closed) with a single set of reads, say 2x101 PE Illumina reads? In this case, I mean complete closure with no Sanger sequencing.
2. Besides synteny, what other information is missing from an unfinished bacterial genome? Stated another way, without closure, what sort of information that would be relevant to a comparative genomics study is impossible/difficult to deduce? (For instance, some genomic/pathogenicity island prediction algorithms require fully closed genomes, presumably due to their association with direct repeats, insertion sequences, etc.) Perhaps my concern regarding closing the genomes is unwarranted?
For these reasons, I hesitate to move forward with the comparative analysis until the genome assemblies are closed. In an effort to explore closing the draft genomes with the original Illumina reads, I have tried the following pipeline with decent results, although the genome still has ~100 scaffolds containing ~1000 N's:
Velvet de novo assembly of random sampling of paired-end reads (and optimization for low # of contigs and high N50) while maintaining sufficient coverage >30x ----> GapFiller to fill ambiguous bases ----> SIS (Scaffolds from Inversion Signatures) against a closely related, closed reference genome
I have also considered:
Bowtie2 to map reads to the aforementioned reference genome ----> SSPACE ----> GapFiller ----> SIS ----> back to bowtie2
or some permutation of this.
Has anybody had success in completely closing a "simple" bacterial genome in this manner? If so, what were your strategies?
Many thanks for your assistance.
Best,
Brady
I am in the early stages of a comparative analysis of several strains of E. coli, and I will admit up-front that I am not a bioinformatician. We already published the draft genome sequences (de novo assembly with Velvet of Illumina 2x101 PE reads), and the quality of the assemblies based on N50 and number of contigs was okay--the assemblies gave our lab most of the information we were interested in at an early date, such as protein-encoding genes and metabolic reconstructions.
However, in an effort to answer some more interesting questions about variable phenotypes between strains, we are performing a comparative genomics analysis. I am concerned about information that is lost due to the lack of synteny that afflicts many draft genomes. Furthermore, I have encountered a few algorithms thus far that require fully closed genomes as input (because they take synteny into account), and it has led me to ask the following questions for bacterial genomes:
1. Is there a proven pipeline for closing "simple" bacterial genomes (E. coli, where many reference strains have already been closed) with a single set of reads, say 2x101 PE Illumina reads? In this case, I mean complete closure with no Sanger sequencing.
2. Besides synteny, what other information is missing from an unfinished bacterial genome? Stated another way, without closure, what sort of information that would be relevant to a comparative genomics study is impossible/difficult to deduce? (For instance, some genomic/pathogenicity island prediction algorithms require fully closed genomes, presumably due to their association with direct repeats, insertion sequences, etc.) Perhaps my concern regarding closing the genomes is unwarranted?
For these reasons, I hesitate to move forward with the comparative analysis until the genome assemblies are closed. In an effort to explore closing the draft genomes with the original Illumina reads, I have tried the following pipeline with decent results, although the genome still has ~100 scaffolds containing ~1000 N's:
Velvet de novo assembly of random sampling of paired-end reads (and optimization for low # of contigs and high N50) while maintaining sufficient coverage >30x ----> GapFiller to fill ambiguous bases ----> SIS (Scaffolds from Inversion Signatures) against a closely related, closed reference genome
I have also considered:
Bowtie2 to map reads to the aforementioned reference genome ----> SSPACE ----> GapFiller ----> SIS ----> back to bowtie2
or some permutation of this.
Has anybody had success in completely closing a "simple" bacterial genome in this manner? If so, what were your strategies?
Many thanks for your assistance.
Best,
Brady
Comment