Seqanswers Leaderboard Ad

**kmcarr** · 01-19-2009, 09:47 AM

John,

You are absolutely on the right track in wanting to do a single, unified assembly of the data. We do exactly the type of analysis you are trying on a very regular basis. We haven't used Newbler for transcript assembly in quite a while. Here is the procedure we use:

Clean the hell out of your raw sequences; trimming polyA (or polyT), vector/adapter sequences and low quality and low complexity regions. It certainly helps to know what procedures/kits/adapters were used in creating the cDNA library used for 454 sequencing so that you can limit screening steps to just those. We first run cross_match to do vector screening. The screened output is then input to the SeqClean. SeqClean (http://compbio.dfci.harvard.edu/tgi/software/) is a pipeline originally created at TIGR for cleaning EST sequences.

After cleaning the reads are fed into the assembly pipeline TGI Clustering Tools (TGICL, also available at the URL above). This is another pipeline first developed by TIGR for clustering and assembling ESTs for their Gene Index project. It calculates pairwise similarity scores for all possible pairwise comparisons. It then performs a transitive clustering of the reads based on these similarity scores. Finally, it assembles each cluster using CAP3. We use parameters a little more stringent than the defaults (minimum overlap and percent identity). At this stage any singletons are set aside and not considered further. All of the contigs created are then assembled together using CAP3, with more relaxed parameters than the first round. You will still end up with multiple contigs which are very similar.

The two stage assembly does add an extra layer of complexity when you are trying to track reads. Since the assembly components of the second round would be contigs themselves you have to track back to which reads made up those contigs from the first round assembly.

If you decide that you do not want to do an entire new assembly I do have an alternative. As you have discovered you will never be able to make a 1-to-1 matching of contigs but you could try to create groups of contigs from the two assemblies. A useful program to do this is blastclust, which is part of the standard NCBI blast toolkit. The grouping can be very stringent (e.g. only finding orhtologous sequences) or more relaxed (grouping sequences from gene families) based on the adjusting the two primary scoring parameters -L and -S. In a situation like yours you will have to be careful with -L parameter. This parameter controls what percentage of the shorter sequence must overlap the longer one. Blastclust was written assuming assuming that people would be comparing complete sequences (transcripts or proteins) so that one sequence should be 'contained' within the other. This is not true for your incomplete transcript assemblies.

I rambled on for quite some time here, I hope you find some of this information useful.

**johnwhitaker** · 01-19-2009, 10:12 AM

Thanks for your detailed reply.

I had been thinking about clustering the contigs today. I have used blastclust before also I was having a look at CLOBB (http://www.biomedcentral.com/1471-2105/3/31/abstract) as it seems to be similar to blastclust but optimised for EST clustering. Also I found a program called wcd (http://bioinformatics.oxfordjournals...ull/24/13/1542) that is also designed EST data.

I think I'll give the clustering a go. If I have lots of problems then I might look into using the pipeline you suggested.

**Emanuel Heitlinger** · 05-20-2009, 07:05 AM

Hi,

I am just following kmcarr's advice using TGICL to cluster ~75.000 ~250nt sequences from a GS FLX run.

CAP3 has a problem, giving the following error:

*** glibc detected *** cap3: corrupted double-linked list: 0x09724b98 ***
======= Backtrace: =========
/lib/libc.so.6[0x8763a4]
/lib/libc.so.6[0x878123]
/lib/libc.so.6(cfree+0x96)[0x878356]
/lib/libc.so.6(fclose+0xea)[0x92b65a]
cap3[0x8056cd6]
cap3[0x80497c0]
cap3(free+0x4a)[0x8048726]
======= Memory map: ========
0070a000-00717000 r-xp 00000000 fd:00 21227675 /lib/libgcc_s-4.3.2-20081105.so.1
00717000-00718000 rwxp 0000c000 fd:00 21227675 /lib/libgcc_s-4.3.2-20081105.so.1
007e2000-00802000 r-xp 00000000 fd:00 21227666 /lib/ld-2.9.so
00803000-00804000 r-xp 00020000 fd:00 21227666 /lib/ld-2.9.so
00804000-00805000 rwxp 00021000 fd:00 21227666 /lib/ld-2.9.so
00807000-00975000 r-xp 00000000 fd:00 21227601 /lib/libc-2.9.so
00975000-00977000 r-xp 0016e000 fd:00 21227601 /lib/libc-2.9.so
00977000-00978000 rwxp 00170000 fd:00 21227601 /lib/libc-2.9.so
00978000-0097b000 rwxp 00978000 00:00 0
08048000-08060000 r-xp 00000000 fd:00 18547219 /home/ele/tools/tgicl_linux/bin/cap3
08060000-08061000 rwxp 00017000 fd:00 18547219 /home/ele/tools/tgicl_linux/bin/cap3
08061000-081e1000 rwxp 08061000 00:00 0
0970e000-097d6000 rwxp 0970e000 00:00 0 [heap]
f7e00000-f7e21000 rwxp f7e00000 00:00 0
f7e21000-f7f00000 ---p f7e21000 00:00 0
f7f9f000-f7fa2000 rwxp f7f9f000 00:00 0
f7fbb000-f7fbc000 rwxp f7fbb000 00:00 0
f7fbc000-f7fbd000 r-xp f7fbc000 00:00 0 [vdso]
ff993000-ff9bd000 rwxp 7ffffffd5000 00:00 0 [stack]

this is repeated several times...

It is the same error whether I use multiple processors (-c option) or just one.

Has anybody an idea what to do?

I'm working on a 2x quad-core, 16GB RAM running fedora10.x86_64.

**sklages** · 05-20-2009, 07:36 AM

Maybe it's a good idea to get the latest cap3 binary for your platform at http://seq.cs.iastate.edu/

If you want to cluster before assembling you could try wcd (http://code.google.com/p/wcdest/) which makes a good job on clustering our data.

You can also use (for assembling your transcriptome dataset) MIRA2 assembler, http://chevreux.org/projects_mira.html

We ran into problems with TGICL with 454 data with very deep clusters (the clustering itself crashed, not the subsequent assembly). The author doesn't recommend TGICL for big NGS datasets.

hth,
Sven

**Emanuel Heitlinger** · 05-21-2009, 03:20 AM

Thanks!

I just downloaded the right binay, copied it to tgicl/bin and voila: It works! The clustering takes only very short time running on 6 processors... I still have to wait wait till CAP3 finishes.

MIRA looks very promising, their web page has great howtos and what they say in their paper on miraEST seems very promising.

SCARF looks promising as well. Although I am sure, that I will not have an organism being related closely enough (working on a notoriously long branching nematode).

I have to take a closer look into the literature as I am still a bit confused which tools do clustering, assembly or both... I will try to post a summary once I have a bit more of a clue.

**donniemarco** · 01-18-2010, 05:18 PM

Wcd

I am trying to run WCD however I am not able to generate sequence cluster that is generated by WCD with default parameters as follows:
wcd -c -N 4 -o test.clust -g -t testSeq.fna

I get the output file test.clust but it does not have clustered sequences just the information as follows:
Index Cluster Link Orientation Witness
0 0 -1 1 -1
1 1 -1 1 -1

I looked further down and its shows the stats:

Number of clusters is 2023 with ave sq of len %
Size of Cluster Number of Clusters
1 1902
2 86
3 25
4 4
5 4
11 1
29 1

And further down below are just each cluster and the links:
...........
208.
210.
211.
212.
213 161 406.
..........

Am I missing any parameter. In the document, its written that sequences are generated. Any help will be highly appreciated.

**sklages** · 01-18-2010, 09:40 PM

Originally posted by donniemarco View Post

[....]
And further down below are just each cluster and the links:
...........
208.
210.
211.
212.
213 161 406.
..........

Am I missing any parameter. In the document, its written that sequences are generated. Any help will be highly appreciated.

This output is all you need ... It tells you that the sequences with index 208, 210, 211 and 212 are singletons, whereas the sequences 213, 161 and 406 form a cluster with the given parameters (note that defaultly counting starts from "0").
It's now up to you to extract the cluster-forming sequences from your input file and assemble them with cap3, phrap, MIRA or whatever assembler ...

cheers,
Sven

**donniemarco** · 01-20-2010, 09:08 AM

Hi Sven,

Thanks for your helpful reply.

If I extract all the cluster forming sequences say 213, 161 and 406 (actual sequence 214, 162, 407: thanks for pointing this as I had missed this) from the file and try to assemble them with say MIRA.

Do you not think that Mira not knowing these being one cluster will again be restricted to perform assembling the same way as these sequences would be assembled from original file (example the transitive closure property that assemblers lack).

I am thinking of using these cluster to generate single supercontig (and then probably rerun with my remainging singleton data with MIRA) however I am looking any particular program to do this. Any help would be appreciated.

Thanks!!

Originally posted by sklages View Post

This output is all you need ... It tells you that the sequences with index 208, 210, 211 and 212 are singletons, whereas the sequences 213, 161 and 406 form a cluster with the given parameters (note that defaultly counting starts from "0").
It's now up to you to extract the cluster-forming sequences from your input file and assemble them with cap3, phrap, MIRA or whatever assembler ...

cheers,
Sven

**sklages** · 01-20-2010, 11:57 AM

Originally posted by donniemarco View Post

Hi Sven,

Thanks for your helpful reply.

If I extract all the cluster forming sequences say 213, 161 and 406 (actual sequence 214, 162, 407: thanks for pointing this as I had missed this) from the file and try to assemble them with say MIRA.

-Q int, --reindex int: reindex sequence ids.

may change the default indexing start.

Do you not think that Mira not knowing these being one cluster will again be restricted to perform assembling the same way as these sequences would be assembled from original file (example the transitive closure property that assemblers lack).

Yes, I'd expect MIRA to assemble these reads similar as if done with the whole dataset, maybe. It's not always easy to directly compare different clustering/assembly approaches. It depends on what you are expecting.

For 454-generated EST datasets I currently use MIRA3 alone, quite successful IMHO. Datasets range from ~300,000 to 1,500.000 reads.

I am thinking of using these cluster to generate single supercontig (and then probably rerun with my remainging singleton data with MIRA) however I am looking any particular program to do this. Any help would be appreciated.
Thanks!!

What do want to achieve? What is a 'supercontig' in this context?

A common way is to use a cluster program (wcd,tclust) and assemble these clusters with "cap3". This works quite fine. You could also use phrap.
I'd use MIRA as a "complete program to assemble ESTs".

If you want to further "join" contigs, just do so with cap3 or phrap
(although I would not do this with my datasets ;-))

Again, everthings depends on what you want / are expecting.

cheers,
Sven

**donniemarco** · 01-21-2010, 02:58 PM

Thanks for the reply.

I will run assembling with mira and see how it goes. BTW, earlier I ran the same data with TIGR but it seems to have crashed while performing assembling with CAP3 (it finished clustering). I used 32gb RAM, 8 cpu processor so I think it was not memory contraint somehow. The no. of reads are approx. 900,000 with file size of 350mb.

Originally posted by sklages View Post

-Q int, --reindex int: reindex sequence ids.

may change the default indexing start.

Yes, I'd expect MIRA to assemble these reads similar as if done with the whole dataset, maybe. It's not always easy to directly compare different clustering/assembly approaches. It depends on what you are expecting.

For 454-generated EST datasets I currently use MIRA3 alone, quite successful IMHO. Datasets range from ~300,000 to 1,500.000 reads.

What do want to achieve? What is a 'supercontig' in this context?

A common way is to use a cluster program (wcd,tclust) and assemble these clusters with "cap3". This works quite fine. You could also use phrap.
I'd use MIRA as a "complete program to assemble ESTs".

If you want to further "join" contigs, just do so with cap3 or phrap
(although I would not do this with my datasets ;-))

Again, everthings depends on what you want / are expecting.

cheers,
Sven

Thanks!!!!!!!!

**Emanuel Heitlinger** · 01-22-2010, 12:54 AM

donniemarco,
did you read the whole thread? Look at my post about progblems with cap within tgicl! If yours are is the same problem, use the solution Sklages provided, it worked for me and for at least one other user of tgicl I know of.

**donniemarco** · 01-22-2010, 08:37 AM

yes, i got the same error and i downloaded the cap3 specific to my machine as suggested by Sklages.
It seemed to have worked as it didn't gave the same error message about cap3 but crashed after 1.5-2 days with error message something as follows:

>>> Assemble [] started at
-Waiting for all children to finish before starting last child!!
-Waiting for all children to finish!!

Process terminated with an error at step 'Assemble' !

TGICL encountered an error at Step 'Aseemble'

I don't have any idea whether it crashed on assembling or last step of clustering when it seems to be waitng for last thread to return?

Originally posted by Emanuel Heitlinger View Post

donniemarco,
did you read the whole thread? Look at my post about progblems with cap within tgicl! If yours are is the same problem, use the solution Sklages provided, it worked for me and for at least one other user of tgicl I know of.

**sklages** · 01-22-2010, 09:55 AM

That's a problem with the TGICL, not with cap3 itself, it seems.
I ran into problems as well with TGICL, that's why I don't use it anymore :-)

cheers,
Sven

**kmcarr** · 01-22-2010, 11:27 AM

Originally posted by sklages View Post

That's a problem with the TGICL, not with cap3 itself, it seems.
I ran into problems as well with TGICL, that's why I don't use it anymore :-)

Originally posted by donniemarco View Post

yes, i got the same error and i downloaded the cap3 specific to my machine as suggested by Sklages.
It seemed to have worked as it didn't gave the same error message about cap3 but crashed after 1.5-2 days with error message something as follows:

>>> Assemble [] started at
-Waiting for all children to finish before starting last child!!
-Waiting for all children to finish!!

Process terminated with an error at step 'Assemble' !

TGICL encountered an error at Step 'Aseemble'

I don't have any idea whether it crashed on assembling or last step of clustering when it seems to be waitng for last thread to return?

It could still be a problem with CAP3. All TIGCL knows is that CAP3 did not exit normally. There are error logs in each of the 'assemble_n' subdirectories. Look in those error logs to get a better idea of the actual error.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

454 assemly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News