Hi all,
I have the following problem. I have a set of PacBio reads, already filtered and known to cover a specific region of interest.
I'd like to see whether there are clusters of variants in this dataset. I tried to use TribeMCL to dissect them based on sequence similarity. However, even with the finest granularity it creates only one cluster.
So I performed a multiple sequence alignment using clustalw2 and created corresponding NJ trees. From the trees it looks like we have three different variants. Is there any possibility to parse the newick tree and dissect the IDs on a specific node level? I mean the sequences (IDs) are my leaves.
It would be enough to say... "at the third branching level, put every ID from branch1 into file 1, branch 2 into file 2..." The reextraction of the sequences based on the IDs and so on to create a new fasta file and building a multiple sequence alignment on them to get a consensus is easy. I'm mainly into perl. I read the manual of Bio::Phylo, but to be honest, I couldn't find any helpful subroutine.
First: What do you think about this idea of creating variant consensus sequences...?
Second: Is anybody aware of a newick or similar tree parser to get what I need?
Thanks in advance!
I have the following problem. I have a set of PacBio reads, already filtered and known to cover a specific region of interest.
I'd like to see whether there are clusters of variants in this dataset. I tried to use TribeMCL to dissect them based on sequence similarity. However, even with the finest granularity it creates only one cluster.
So I performed a multiple sequence alignment using clustalw2 and created corresponding NJ trees. From the trees it looks like we have three different variants. Is there any possibility to parse the newick tree and dissect the IDs on a specific node level? I mean the sequences (IDs) are my leaves.
It would be enough to say... "at the third branching level, put every ID from branch1 into file 1, branch 2 into file 2..." The reextraction of the sequences based on the IDs and so on to create a new fasta file and building a multiple sequence alignment on them to get a consensus is easy. I'm mainly into perl. I read the manual of Bio::Phylo, but to be honest, I couldn't find any helpful subroutine.
First: What do you think about this idea of creating variant consensus sequences...?
Second: Is anybody aware of a newick or similar tree parser to get what I need?
Thanks in advance!
Comment