I've assembled several microbial genomes with the PacBio smrtportal 2.2.0 using various settings and assemblers (HGAP2 , HGAP3 on the same data) with coverages of about 90x up to 185x. I've been following recommendations on finishing that are on the github wiki to check for assembly artefacts.
Although I get nicely closed contigs with these methods I do see about 50-100 differences between the various final assemblies, mainly indels without a preference for a particular type of assembly strategy.
Some of these indels do have an effect on gene prediction and annotation, particularly frameshifts or truncation of proteins, again for as far as I've been able to see without a preference for a certain strategy. I've some examples where the 'correct' gene is predicted on a default hgap3 assembly but the hgap2 +RS_resequencing consensus contained frameshifts compared to highly similar genes in related organisms and vice versa.
What would be a good strategy to assess the minute differences between these different assemblies in a structural way?
Although I get nicely closed contigs with these methods I do see about 50-100 differences between the various final assemblies, mainly indels without a preference for a particular type of assembly strategy.
Some of these indels do have an effect on gene prediction and annotation, particularly frameshifts or truncation of proteins, again for as far as I've been able to see without a preference for a certain strategy. I've some examples where the 'correct' gene is predicted on a default hgap3 assembly but the hgap2 +RS_resequencing consensus contained frameshifts compared to highly similar genes in related organisms and vice versa.
What would be a good strategy to assess the minute differences between these different assemblies in a structural way?
Comment