Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple alignment distances

    I have 60 bp pyrosequencing data that targets the 16s rRNA gene to discover bacterial communities.

    My first step is to pairwise align the pyrosequencing data against itself, then cluster the sequences using a 97% sequence similarity threshold. Then for a particular cluster, perform multiple alignment.

    My question is the following: for sequence X in a particular multiple alignment, is there a tool that will find the most 'similar' sequence (i.e. the distance in the tree constructed by the multiple aligner) to sequence X in the multiple alignment?

    I've written some scripts in perl, but I'm concerned for it's robustness. I use MUSCLE for multiple alignment and use the tree it constructs to obtain the 'similarity' measures between sequences.

    Thanks

  • #2
    This has been an active area of research since the 80s or even before.

    There is a whole load of tools in the Phylip package for example. I would use these rather than write your own scripts.

    By the way : what do you mean "align the pyroseq data against itself" ? Why not just cluster raw reads.

    Also, you could have a look at the qiime package.

    Comment


    • #3
      Thanks, I will check out those tools.

      I didn't fully explain the background of the problem when I said align the sequences against themselves. I won't be aligning the sequences against themselves. Say I have two sequence datasets. Take sequence x from set A and align it against all sequences in set B. Then form a cluster/group, which contains sequence x and all sequences in set B that are 97% similar. Then use multiple alignment on that group.

      Comment


      • #4
        I've worked with a pathogen discovery pipeline that has an assembly of ribosomal sequences as its first step (using Geneious, but you could probably use something else), which sounds pretty similar to what you want to do here. Is there something different from assembly that you're wanting to do with these sequences?

        Comment


        • #5
          I' m not familiar with the pipeline that you described, so I' am not completely clear with what you mean by "has an assembly of ribosomal sequences as its first step". My guess is that you have sequence reads and you assemble them into multiple, different ribosomal sequences. You then assigned taxonomies to as many of the assembled sequences as you could.

          If I' am correct, that is not what I' am doing. All my sequence fragments target a specific region of the 16s rRNA gene. I have two data sets (call them set A and set B). My first step is to use the Ribosomal Database Project (RDP) classifier to assign taxonomies to the sequences in set A. I described my second step in my second post above. Take a sequence x in set B, align against all sequences in set A, group the sequences based on a 97% similarity threshold, and then perform multiple alignment on each group. My objective is to "assign" the taxonomies from the sequences in set A to the sequences in set B.

          I just want a tool that will quickly tell me which sequence in set A is most "similar" to sequence x in set B. That is how I will determine the taxa assignment.

          Comment


          • #6
            Originally posted by murphycj View Post
            I' m not familiar with the pipeline that you described, so I' am not completely clear with what you mean by "has an assembly of ribosomal sequences as its first step". My guess is that you have sequence reads and you assemble them into multiple, different ribosomal sequences. You then assigned taxonomies to as many of the assembled sequences as you could.
            Yes, this is what was done. How many sequences do you have? If it's few enough that MUSCLE works in a reasonable amount of time, then I don't see a reason why it wouldn't be appropriate.

            Take a sequence x in set B, align against all sequences in set A, group the sequences based on a 97% similarity threshold, and then perform multiple alignment on each group. My objective is to "assign" the taxonomies from the sequences in set A to the sequences in set B.

            I just want a tool that will quickly tell me which sequence in set A is most "similar" to sequence x in set B. That is how I will determine the taxa assignment.
            Considering alternatives to MUSCLE, This is somewhat similar to the "merging sequence sets" function of minimus2 (or minimus2-blat) from AMOS, which I've typically been using for merging two different assemblies:

            Download AMOS for free. AMOS is a collection of tools for genome assembly. AMOS is a collection of tools and class interfaces for the assembly of DNA reads. The package includes a robust infrastructure, modular assembly pipelines, and tools for overlapping, consensus generation, contigging, and assembly manipulation.


            You can define the overlap %ID (in your case, it would be 0.97), the degree of allowed overlap (I guess you'll want something close to the full length of the 16s region), and maximum error from the consensus sequence.

            Once the sets are merged, you can use hawkeye (also in AMOS) to view the assembled consensus contigs. For each sequence in one set (e.g. set B), it can tell you what sequences in the other set (e.g. set A) also align to that sequence. There are also command-line tools for producing lists of the matched sequences.

            The workflow outlined on the sourceforge page does this using the REFCOUNT option to distinguish sets, but if you also want to combine similar sequences in set B as well, then don't use this option and an all vs all assembly will be done.

            Comment


            • #7
              I have about 500,000 sequences, but I will only align up to a few hundred of sequences at a time.

              The tools you are describing sound as if they may make parsing the output a little harder, but I will check them out anyway. Thanks!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              32 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              48 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              34 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              46 views
              0 likes
              Last Post seqadmin  
              Working...
              X