Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Giant alignment, high identity, which model for phylogeny?

    Hello,

    My goal is a phylogeny of multiple isolates, showing me which isolate is closer to which.

    I got an organism from which I did population genomics from a few distant geographic locations. The genome size is about 7-10mb.

    I did denovo assemblies using MIRA, for all of my isolates. I picked the best assembly, concatenated all the contigs, and mapped the reads of the other isolate on top of it to generate a new consensus for each of the other isolates.

    Now, because the species is heterozygous, I picked a cutoff value of 85% when calling basepairs for the consensus. This should get heterozygous loci to be called as an ambiguity. I now took the consensus of all isolates, and aligned it using MAUVE. I trimmed out all sites that had ambiguities, thus removing heterozygous sites.

    I am left with a very long alignment, still about 7-10mb, and only a few thousand sites having any variability whatsoever, spaced out pretty consistently.

    Now for the phylogeny, i picked a simple F model, 100 BS, estimated I and G, phyml.

    Any thoughts on this? It would be really helpful for some advice, what might I have omitted? Is PHYML he best for this kind of analysis, or should I try bayesian, and if so, mr bayes, phylobayes or even beagle? Are there any alternatives to MAUVE?

    Thank you for your help,
    Adrian

  • #2
    I usually do FastTree for a general feeling and then RAxML and PhyloBayes..
    savetherhino.org

    Comment


    • #3
      Since these are all the same species, and just isolates, should I use a strict molecular clock?

      Also, does anyone else have experience with heterozygous (50/50) sites in your reference? Is it a good idea to remove them before trying to reconstruct strain relationship?

      Thanks you!

      Comment


      • #4
        Bump. If anyone has any additional input.

        Comment


        • #5
          Perhaps you could just extract informative sites and use just them like SNPs, since computational burden of analyzing megabases via ML or bayesian inference is tremendous, and most sequence doesn't carry any information anyway.
          Also, the "concatenate contigs (in whatever order and strand orientation they happen to be in assembly) and map reads of other isolates on resulting sequence" part doesn't look really cool. I'm not sure if gene calling and therefore distinguishing neutral vs non-neutral SNPs will be reliable with such and approach. In addition, it throws away all data on real gene order, which can be valuable phylogenetic marker, and imposes a semi-artifactual one.

          PS: what's the point in creating several nearly identical threads? Bump it if nobody answers in a couple of weeks or so.
          Last edited by A_Morozov; 08-27-2013, 10:25 PM.

          Comment


          • #6
            Hey,

            I'd also say you should try do downsize your data to the most informative sites. To infer those maybe a good starting point is to use 'GenomeRing' (GenomeRing). It visualizes differences between genomes in a quite fancy way so you can easliy see at which regions you genomes differ. From there, you could extract the sites which differ in at least say 2 genomes. And infer a phylogeny on only those sites giving you at least an idea whats going on in a phylogentic manner.

            Best phil
            Last edited by sphil; 08-28-2013, 12:25 AM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X