AdrianP 08-07-2013 09:01 AM

Giant alignment, high identity, which model for phylogeny?

My goal is a phylogeny of multiple isolates, showing me which isolate is closer to which.

I got an organism from which I did population genomics from a few distant geographic locations. The genome size is about 7-10mb.

I did denovo assemblies using MIRA, for all of my isolates. I picked the best assembly, concatenated all the contigs, and mapped the reads of the other isolate on top of it to generate a new consensus for each of the other isolates.

Now, because the species is heterozygous, I picked a cutoff value of 85% when calling basepairs for the consensus. This should get heterozygous loci to be called as an ambiguity. I now took the consensus of all isolates, and aligned it using MAUVE. I trimmed out all sites that had ambiguities, thus removing heterozygous sites.

I am left with a very long alignment, still about 7-10mb, and only a few thousand sites having any variability whatsoever, spaced out pretty consistently.

Now for the phylogeny, i picked a simple F model, 100 BS, estimated I and G, phyml.

Any thoughts on this? It would be really helpful for some advice, what might I have omitted? Is PHYML he best for this kind of analysis, or should I try bayesian, and if so, mr bayes, phylobayes or even beagle? Are there any alternatives to MAUVE?

rhinoceros 08-07-2013 09:25 AM

I usually do FastTree for a general feeling and then RAxML and PhyloBayes..

AdrianP 08-07-2013 07:00 PM

Since these are all the same species, and just isolates, should I use a strict molecular clock?

Also, does anyone else have experience with heterozygous (50/50) sites in your reference? Is it a good idea to remove them before trying to reconstruct strain relationship?

AdrianP 08-27-2013 03:14 PM

A_Morozov 08-27-2013 11:23 PM

Perhaps you could just extract informative sites and use just them like SNPs, since computational burden of analyzing megabases via ML or bayesian inference is tremendous, and most sequence doesn't carry any information anyway.
Also, the "concatenate contigs (in whatever order and strand orientation they happen to be in assembly) and map reads of other isolates on resulting sequence" part doesn't look really cool. I'm not sure if gene calling and therefore distinguishing neutral vs non-neutral SNPs will be reliable with such and approach. In addition, it throws away all data on real gene order, which can be valuable phylogenetic marker, and imposes a semi-artifactual one.

sphil 08-28-2013 01:23 AM


I'd also say you should try do downsize your data to the most informative sites. To infer those maybe a good starting point is to use 'GenomeRing' (GenomeRing). It visualizes differences between genomes in a quite fancy way so you can easliy see at which regions you genomes differ. From there, you could extract the sites which differ in at least say 2 genomes. And infer a phylogeny on only those sites giving you at least an idea whats going on in a phylogentic manner.

