Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PacBio reads: dissect variants by phylogenetic distance

    Hi all,
    I have the following problem. I have a set of PacBio reads, already filtered and known to cover a specific region of interest.
    I'd like to see whether there are clusters of variants in this dataset. I tried to use TribeMCL to dissect them based on sequence similarity. However, even with the finest granularity it creates only one cluster.
    So I performed a multiple sequence alignment using clustalw2 and created corresponding NJ trees. From the trees it looks like we have three different variants. Is there any possibility to parse the newick tree and dissect the IDs on a specific node level? I mean the sequences (IDs) are my leaves.
    It would be enough to say... "at the third branching level, put every ID from branch1 into file 1, branch 2 into file 2..." The reextraction of the sequences based on the IDs and so on to create a new fasta file and building a multiple sequence alignment on them to get a consensus is easy. I'm mainly into perl. I read the manual of Bio::Phylo, but to be honest, I couldn't find any helpful subroutine.

    First: What do you think about this idea of creating variant consensus sequences...?
    Second: Is anybody aware of a newick or similar tree parser to get what I need?

    Thanks in advance!

  • #2
    Out of curiosity .. Did you use fasta format reads to build the trees? What is the length distribution of the reads (specially within a clade)? If the reads are varying length how are you planning to generate a consensus.

    Comment


    • #3
      Yes, fasta format. It varies from 1500 bp -3000 bp. Two approaches, one: creating a multiple sequence alignment with fixed length, two: taking the longest sequence as seed.

      Comment


      • #4
        Are these metagenomic 16S reads? And are they CCS (self-corrected)?

        Comment


        • #5
          I am not certain if the first approach (MSA with fixed length) will work. Especially considering the variable length of the sequences. Proof is in the pudding so may be worth trying. Did you rename the sequences (because I would have thought clustal would not like the long sequence identifiers from pacbio)?

          I wonder if you can use any of the iso-Seq clustering tools built into SMRTportal.

          Here is a thread to parse labels from newick trees. From there you should be able to use faSomeRecords (from Kent utilities) to extract the sequences.

          Comment


          • #6
            @ Brian,
            they are self-corrected and no, no metagenomics, no 16S reads -> inverse PCR reads one organism.

            @ GenoMax,
            clustal had no problems with the long names. I'm done with everything, the alignment, the tree building. I just need some tree parser which extracts all leaves originated from a specified node.

            Comment


            • #7
              I have had good success clustering self-corrected PacBio 16s reads using Dedupe (part of the BBMap package) with these commands:

              reformat.sh in=reads_of_insert.fastq out=filtered.fq minlen=1420 maxlen=1640 maq=20 qin=33

              dedupe.sh in=filtered.fq csf=stats_e26.txt outbest=best_e26.fq qin=33 -Xmx30g am=f ac=f fo c mcs=3 k=27 mo=1420 ow unpigz cc pto nam=4 e=26 pattern=cluster_%.fq dot=graph.dot


              However, those are for sequences around 1500bp long. The settings would need to be changed for longer sequences (particularly "maxlen=1640" in the reformat step, and probably "maq=20" which removes sequences with over 1% average expected error rate, and "e=26" in dedupe which allows an edit distance of 26, if your longer reads have substantially more errors than that).

              What this does is to make clusters based on transitive reachability by overlaps. So if A overlaps B by at least 1420bp and at most 26 edits, and B overlaps C with those same criteria, then A, B, and C would be in the same cluster. The reformat phase is to remove low-quality reads and chimeras.

              Anyway, you might try that if the traditional approaches don't work.

              Comment


              • #8
                I'll give it a try! bbmap was already so useful for other tasks. thanks for developing this tool!

                Comment


                • #9
                  Is it possible to do the clustering with fasta files? I lost the quality information a few steps before...

                  Comment


                  • #10
                    Since you have read identifiers why not go back to the original file and pull those reads out in fastq format?

                    Otherwise, you can probably start with your original data file (which is what @Brian is probably referring to in the example above).

                    Comment


                    • #11
                      I do not use the full dataset at this step. Yes, then I have to recover

                      Comment


                      • #12
                        You probably know this, but a great alternative for clustering fasta sequences is cd-hit or cd-hit-est

                        Comment


                        • #13
                          Why do you think it will work better than TribeMCL? Just giving it a try?

                          Comment


                          • #14
                            Originally posted by uloeber View Post
                            Is it possible to do the clustering with fasta files? I lost the quality information a few steps before...
                            You can, you just won't be able to filter out the low-quality sequences.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            51 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X