Thread: 454 assemly
View Single Post
Old 01-19-2009, 08:47 AM   #2
Senior Member
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178


You are absolutely on the right track in wanting to do a single, unified assembly of the data. We do exactly the type of analysis you are trying on a very regular basis. We haven't used Newbler for transcript assembly in quite a while. Here is the procedure we use:

Clean the hell out of your raw sequences; trimming polyA (or polyT), vector/adapter sequences and low quality and low complexity regions. It certainly helps to know what procedures/kits/adapters were used in creating the cDNA library used for 454 sequencing so that you can limit screening steps to just those. We first run cross_match to do vector screening. The screened output is then input to the SeqClean. SeqClean ( is a pipeline originally created at TIGR for cleaning EST sequences.

After cleaning the reads are fed into the assembly pipeline TGI Clustering Tools (TGICL, also available at the URL above). This is another pipeline first developed by TIGR for clustering and assembling ESTs for their Gene Index project. It calculates pairwise similarity scores for all possible pairwise comparisons. It then performs a transitive clustering of the reads based on these similarity scores. Finally, it assembles each cluster using CAP3. We use parameters a little more stringent than the defaults (minimum overlap and percent identity). At this stage any singletons are set aside and not considered further. All of the contigs created are then assembled together using CAP3, with more relaxed parameters than the first round. You will still end up with multiple contigs which are very similar.

The two stage assembly does add an extra layer of complexity when you are trying to track reads. Since the assembly components of the second round would be contigs themselves you have to track back to which reads made up those contigs from the first round assembly.

If you decide that you do not want to do an entire new assembly I do have an alternative. As you have discovered you will never be able to make a 1-to-1 matching of contigs but you could try to create groups of contigs from the two assemblies. A useful program to do this is blastclust, which is part of the standard NCBI blast toolkit. The grouping can be very stringent (e.g. only finding orhtologous sequences) or more relaxed (grouping sequences from gene families) based on the adjusting the two primary scoring parameters -L and -S. In a situation like yours you will have to be careful with -L parameter. This parameter controls what percentage of the shorter sequence must overlap the longer one. Blastclust was written assuming assuming that people would be comparing complete sequences (transcripts or proteins) so that one sequence should be 'contained' within the other. This is not true for your incomplete transcript assemblies.

I rambled on for quite some time here, I hope you find some of this information useful.
kmcarr is offline   Reply With Quote