Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple alignment distances

    I have 60 bp pyrosequencing data that targets the 16s rRNA gene to discover bacterial communities.

    My first step is to pairwise align the pyrosequencing data against itself, then cluster the sequences using a 97% sequence similarity threshold. Then for a particular cluster, perform multiple alignment.

    My question is the following: for sequence X in a particular multiple alignment, is there a tool that will find the most 'similar' sequence (i.e. the distance in the tree constructed by the multiple aligner) to sequence X in the multiple alignment?

    I've written some scripts in perl, but I'm concerned for it's robustness. I use MUSCLE for multiple alignment and use the tree it constructs to obtain the 'similarity' measures between sequences.

    Thanks

  • #2
    This has been an active area of research since the 80s or even before.

    There is a whole load of tools in the Phylip package for example. I would use these rather than write your own scripts.

    By the way : what do you mean "align the pyroseq data against itself" ? Why not just cluster raw reads.

    Also, you could have a look at the qiime package.

    Comment


    • #3
      Thanks, I will check out those tools.

      I didn't fully explain the background of the problem when I said align the sequences against themselves. I won't be aligning the sequences against themselves. Say I have two sequence datasets. Take sequence x from set A and align it against all sequences in set B. Then form a cluster/group, which contains sequence x and all sequences in set B that are 97% similar. Then use multiple alignment on that group.

      Comment


      • #4
        I've worked with a pathogen discovery pipeline that has an assembly of ribosomal sequences as its first step (using Geneious, but you could probably use something else), which sounds pretty similar to what you want to do here. Is there something different from assembly that you're wanting to do with these sequences?

        Comment


        • #5
          I' m not familiar with the pipeline that you described, so I' am not completely clear with what you mean by "has an assembly of ribosomal sequences as its first step". My guess is that you have sequence reads and you assemble them into multiple, different ribosomal sequences. You then assigned taxonomies to as many of the assembled sequences as you could.

          If I' am correct, that is not what I' am doing. All my sequence fragments target a specific region of the 16s rRNA gene. I have two data sets (call them set A and set B). My first step is to use the Ribosomal Database Project (RDP) classifier to assign taxonomies to the sequences in set A. I described my second step in my second post above. Take a sequence x in set B, align against all sequences in set A, group the sequences based on a 97% similarity threshold, and then perform multiple alignment on each group. My objective is to "assign" the taxonomies from the sequences in set A to the sequences in set B.

          I just want a tool that will quickly tell me which sequence in set A is most "similar" to sequence x in set B. That is how I will determine the taxa assignment.

          Comment


          • #6
            Originally posted by murphycj View Post
            I' m not familiar with the pipeline that you described, so I' am not completely clear with what you mean by "has an assembly of ribosomal sequences as its first step". My guess is that you have sequence reads and you assemble them into multiple, different ribosomal sequences. You then assigned taxonomies to as many of the assembled sequences as you could.
            Yes, this is what was done. How many sequences do you have? If it's few enough that MUSCLE works in a reasonable amount of time, then I don't see a reason why it wouldn't be appropriate.

            Take a sequence x in set B, align against all sequences in set A, group the sequences based on a 97% similarity threshold, and then perform multiple alignment on each group. My objective is to "assign" the taxonomies from the sequences in set A to the sequences in set B.

            I just want a tool that will quickly tell me which sequence in set A is most "similar" to sequence x in set B. That is how I will determine the taxa assignment.
            Considering alternatives to MUSCLE, This is somewhat similar to the "merging sequence sets" function of minimus2 (or minimus2-blat) from AMOS, which I've typically been using for merging two different assemblies:

            Download AMOS for free. AMOS is a collection of tools for genome assembly. AMOS is a collection of tools and class interfaces for the assembly of DNA reads. The package includes a robust infrastructure, modular assembly pipelines, and tools for overlapping, consensus generation, contigging, and assembly manipulation.


            You can define the overlap %ID (in your case, it would be 0.97), the degree of allowed overlap (I guess you'll want something close to the full length of the 16s region), and maximum error from the consensus sequence.

            Once the sets are merged, you can use hawkeye (also in AMOS) to view the assembled consensus contigs. For each sequence in one set (e.g. set B), it can tell you what sequences in the other set (e.g. set A) also align to that sequence. There are also command-line tools for producing lists of the matched sequences.

            The workflow outlined on the sourceforge page does this using the REFCOUNT option to distinguish sets, but if you also want to combine similar sequences in set B as well, then don't use this option and an all vs all assembly will be done.

            Comment


            • #7
              I have about 500,000 sequences, but I will only align up to a few hundred of sequences at a time.

              The tools you are describing sound as if they may make parsing the output a little harder, but I will check them out anyway. Thanks!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              29 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X