Are there any tools capable of performing multi sequence alignment and tree construction that can handle up to 100,000 sequences?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by brachysclereid View PostAre there any tools capable of performing multi sequence alignment and tree construction that can handle up to 100,000 sequences?
In order to do what you are asking, you really should cluster the sequences first, then align from smaller sets.
MUSCLE's manual recommends using the UCLUST tool in USEARCH (http://www.drive5.com/usearch/) to reduce the number of alignments needed. And if you contact MUSCLE's author (http://www.drive5.com/muscle/about.htm) he mentions that he is working on something to leverage MUSCLE and USEARCH to deal with huge numbers of sequence alignments (without overly sacrificing accuracy).Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
-
Are your sequences all from the same locus, e.g. amplicon sequenced? If so, the usual approach to build a phylogeny involves first taking a representative subset, say 500 sequences, and building a profile HMM using HMMer3 or building a SCFG using Infernal if the sequences are RNA with known secondary structure. With your HMM or SCFG in hand you can then quickly construct a multiple alignment of all 100k of them. see hmmalign. You can then feed the alignment to Morgan Price's FastTree or one of the other programs that can construct a phylogeny in subquadratic time.
That approach works well if you care only about substitutions in the evolutionary history (this is what most people want since they usually hold the richest evolutionary signal). If you care about indel histories you will have to do a full multiple sequence alignment with something like MUSCLE, MAFFT or another tool. I am not aware of any indel history inference tools that can operate on such large datasets. Maybe someone else knows of such methods?
Comment
-
building a cladogram with 100,000 sequences
Thanks for all the advice. It will take a while to sort through all the possibilities. Perhaps clustering first will be a good approach.
I'll make a post if I can get something to work. In the mean time other suggestions would be greatly appreciated!
Comment
Latest Articles
Collapse
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
32 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
53 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Comment