Seqanswers Leaderboard Ad

**colindaven** · 03-02-2012, 04:20 AM

This has been an active area of research since the 80s or even before.

There is a whole load of tools in the Phylip package for example. I would use these rather than write your own scripts.

By the way : what do you mean "align the pyroseq data against itself" ? Why not just cluster raw reads.

Also, you could have a look at the qiime package.

**murphycj** · 03-02-2012, 11:22 AM

Thanks, I will check out those tools.

I didn't fully explain the background of the problem when I said align the sequences against themselves. I won't be aligning the sequences against themselves. Say I have two sequence datasets. Take sequence x from set A and align it against all sequences in set B. Then form a cluster/group, which contains sequence x and all sequences in set B that are 97% similar. Then use multiple alignment on that group.

**gringer** · 03-02-2012, 12:49 PM

I've worked with a pathogen discovery pipeline that has an assembly of ribosomal sequences as its first step (using Geneious, but you could probably use something else), which sounds pretty similar to what you want to do here. Is there something different from assembly that you're wanting to do with these sequences?

**murphycj** · 03-02-2012, 05:09 PM

I' m not familiar with the pipeline that you described, so I' am not completely clear with what you mean by "has an assembly of ribosomal sequences as its first step". My guess is that you have sequence reads and you assemble them into multiple, different ribosomal sequences. You then assigned taxonomies to as many of the assembled sequences as you could.

If I' am correct, that is not what I' am doing. All my sequence fragments target a specific region of the 16s rRNA gene. I have two data sets (call them set A and set B). My first step is to use the Ribosomal Database Project (RDP) classifier to assign taxonomies to the sequences in set A. I described my second step in my second post above. Take a sequence x in set B, align against all sequences in set A, group the sequences based on a 97% similarity threshold, and then perform multiple alignment on each group. My objective is to "assign" the taxonomies from the sequences in set A to the sequences in set B.

I just want a tool that will quickly tell me which sequence in set A is most "similar" to sequence x in set B. That is how I will determine the taxa assignment.

**gringer** · 03-03-2012, 01:02 AM

Originally posted by murphycj View Post

I' m not familiar with the pipeline that you described, so I' am not completely clear with what you mean by "has an assembly of ribosomal sequences as its first step". My guess is that you have sequence reads and you assemble them into multiple, different ribosomal sequences. You then assigned taxonomies to as many of the assembled sequences as you could.

Yes, this is what was done. How many sequences do you have? If it's few enough that MUSCLE works in a reasonable amount of time, then I don't see a reason why it wouldn't be appropriate.

Take a sequence x in set B, align against all sequences in set A, group the sequences based on a 97% similarity threshold, and then perform multiple alignment on each group. My objective is to "assign" the taxonomies from the sequences in set A to the sequences in set B.

I just want a tool that will quickly tell me which sequence in set A is most "similar" to sequence x in set B. That is how I will determine the taxa assignment.

Considering alternatives to MUSCLE, This is somewhat similar to the "merging sequence sets" function of minimus2 (or minimus2-blat) from AMOS, which I've typically been using for merging two different assemblies:

AMOS

http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus2

Download AMOS for free. AMOS is a collection of tools for genome assembly. AMOS is a collection of tools and class interfaces for the assembly of DNA reads. The package includes a robust infrastructure, modular assembly pipelines, and tools for overlapping, consensus generation, contigging, and assembly manipulation.

You can define the overlap %ID (in your case, it would be 0.97), the degree of allowed overlap (I guess you'll want something close to the full length of the 16s region), and maximum error from the consensus sequence.

Once the sets are merged, you can use hawkeye (also in AMOS) to view the assembled consensus contigs. For each sequence in one set (e.g. set B), it can tell you what sequences in the other set (e.g. set A) also align to that sequence. There are also command-line tools for producing lists of the matched sequences.

The workflow outlined on the sourceforge page does this using the REFCOUNT option to distinguish sets, but if you also want to combine similar sequences in set B as well, then don't use this option and an all vs all assembly will be done.

**murphycj** · 03-05-2012, 07:04 PM

I have about 500,000 sequences, but I will only align up to a few hundred of sequences at a time.

The tools you are describing sound as if they may make parsing the output a little harder, but I will check them out anyway. Thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Multiple alignment distances

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News