SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > 454 Pyrosequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to use CLC bio software for assemly with SOLiD noas SOLiD 3 12-08-2011 06:54 AM
Difference between 454 pair end library and 454 standard library edge 454 Pyrosequencing 1 09-30-2009 01:08 AM

Reply
 
Thread Tools
Old 01-15-2009, 08:06 AM   #1
johnwhitaker
Junior Member
 
Location: leeds

Join Date: Jan 2009
Posts: 6
Question 454 assemly

Hello,

I have been given two sets 454 data which have already been assembled into contigs using the newbeler assembler. The data is transcript sequence from two different flower types, from a plant species which has not yet had its genome sequenced.

They want me to compare the expression levels between two sets of data by comparing the number of reads which were used to make up matching contigs. The coverage isn't that great so some genes have multiple contigs. I can tell this as their are many cases several smaller contigs from one sample (but still in the 100's) align with a larger ones in the other.

I thought the best way to deal with this would be to pool the reads from the two samples and assemble them. Then compare the number of reads from each sample that make up each of the contigs.

I don't have access to newbeler so I tired re-assembling one of the samples with velvet. But this resulted in 25,091 contigs whereas newebler had produced 13,990.

Does anyone know which of the publicly available assemeblers works bets on 454 data? Or has anyone got an suggestion on how to deal with this data?

Thanks,

John
johnwhitaker is offline   Reply With Quote
Old 01-19-2009, 08:47 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,169
Default

John,

You are absolutely on the right track in wanting to do a single, unified assembly of the data. We do exactly the type of analysis you are trying on a very regular basis. We haven't used Newbler for transcript assembly in quite a while. Here is the procedure we use:

Clean the hell out of your raw sequences; trimming polyA (or polyT), vector/adapter sequences and low quality and low complexity regions. It certainly helps to know what procedures/kits/adapters were used in creating the cDNA library used for 454 sequencing so that you can limit screening steps to just those. We first run cross_match to do vector screening. The screened output is then input to the SeqClean. SeqClean (http://compbio.dfci.harvard.edu/tgi/software/) is a pipeline originally created at TIGR for cleaning EST sequences.

After cleaning the reads are fed into the assembly pipeline TGI Clustering Tools (TGICL, also available at the URL above). This is another pipeline first developed by TIGR for clustering and assembling ESTs for their Gene Index project. It calculates pairwise similarity scores for all possible pairwise comparisons. It then performs a transitive clustering of the reads based on these similarity scores. Finally, it assembles each cluster using CAP3. We use parameters a little more stringent than the defaults (minimum overlap and percent identity). At this stage any singletons are set aside and not considered further. All of the contigs created are then assembled together using CAP3, with more relaxed parameters than the first round. You will still end up with multiple contigs which are very similar.

The two stage assembly does add an extra layer of complexity when you are trying to track reads. Since the assembly components of the second round would be contigs themselves you have to track back to which reads made up those contigs from the first round assembly.

If you decide that you do not want to do an entire new assembly I do have an alternative. As you have discovered you will never be able to make a 1-to-1 matching of contigs but you could try to create groups of contigs from the two assemblies. A useful program to do this is blastclust, which is part of the standard NCBI blast toolkit. The grouping can be very stringent (e.g. only finding orhtologous sequences) or more relaxed (grouping sequences from gene families) based on the adjusting the two primary scoring parameters -L and -S. In a situation like yours you will have to be careful with -L parameter. This parameter controls what percentage of the shorter sequence must overlap the longer one. Blastclust was written assuming assuming that people would be comparing complete sequences (transcripts or proteins) so that one sequence should be 'contained' within the other. This is not true for your incomplete transcript assemblies.

I rambled on for quite some time here, I hope you find some of this information useful.
kmcarr is offline   Reply With Quote
Old 01-19-2009, 09:12 AM   #3
johnwhitaker
Junior Member
 
Location: leeds

Join Date: Jan 2009
Posts: 6
Default

Thanks for your detailed reply.

I had been thinking about clustering the contigs today. I have used blastclust before also I was having a look at CLOBB (http://www.biomedcentral.com/1471-2105/3/31/abstract) as it seems to be similar to blastclust but optimised for EST clustering. Also I found a program called wcd (http://bioinformatics.oxfordjournals...ull/24/13/1542) that is also designed EST data.

I think I'll give the clustering a go. If I have lots of problems then I might look into using the pipeline you suggested.
johnwhitaker is offline   Reply With Quote
Old 05-20-2009, 07:05 AM   #4
Emanuel Heitlinger
Member
 
Location: Karlsruhe

Join Date: Mar 2009
Posts: 11
Default

Hi,

I am just following kmcarr's advice using TGICL to cluster ~75.000 ~250nt sequences from a GS FLX run.

CAP3 has a problem, giving the following error:
*** glibc detected *** cap3: corrupted double-linked list: 0x09724b98 ***
======= Backtrace: =========
/lib/libc.so.6[0x8763a4]
/lib/libc.so.6[0x878123]
/lib/libc.so.6(cfree+0x96)[0x878356]
/lib/libc.so.6(fclose+0xea)[0x92b65a]
cap3[0x8056cd6]
cap3[0x80497c0]
cap3(free+0x4a)[0x8048726]
======= Memory map: ========
0070a000-00717000 r-xp 00000000 fd:00 21227675 /lib/libgcc_s-4.3.2-20081105.so.1
00717000-00718000 rwxp 0000c000 fd:00 21227675 /lib/libgcc_s-4.3.2-20081105.so.1
007e2000-00802000 r-xp 00000000 fd:00 21227666 /lib/ld-2.9.so
00803000-00804000 r-xp 00020000 fd:00 21227666 /lib/ld-2.9.so
00804000-00805000 rwxp 00021000 fd:00 21227666 /lib/ld-2.9.so
00807000-00975000 r-xp 00000000 fd:00 21227601 /lib/libc-2.9.so
00975000-00977000 r-xp 0016e000 fd:00 21227601 /lib/libc-2.9.so
00977000-00978000 rwxp 00170000 fd:00 21227601 /lib/libc-2.9.so
00978000-0097b000 rwxp 00978000 00:00 0
08048000-08060000 r-xp 00000000 fd:00 18547219 /home/ele/tools/tgicl_linux/bin/cap3
08060000-08061000 rwxp 00017000 fd:00 18547219 /home/ele/tools/tgicl_linux/bin/cap3
08061000-081e1000 rwxp 08061000 00:00 0
0970e000-097d6000 rwxp 0970e000 00:00 0 [heap]
f7e00000-f7e21000 rwxp f7e00000 00:00 0
f7e21000-f7f00000 ---p f7e21000 00:00 0
f7f9f000-f7fa2000 rwxp f7f9f000 00:00 0
f7fbb000-f7fbc000 rwxp f7fbb000 00:00 0
f7fbc000-f7fbd000 r-xp f7fbc000 00:00 0 [vdso]
ff993000-ff9bd000 rwxp 7ffffffd5000 00:00 0 [stack]
this is repeated several times...

It is the same error whether I use multiple processors (-c option) or just one.

Has anybody an idea what to do?

I'm working on a 2x quad-core, 16GB RAM running fedora10.x86_64.

Last edited by Emanuel Heitlinger; 05-20-2009 at 07:08 AM.
Emanuel Heitlinger is offline   Reply With Quote
Old 05-20-2009, 07:36 AM   #5
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Maybe it's a good idea to get the latest cap3 binary for your platform at http://seq.cs.iastate.edu/

If you want to cluster before assembling you could try wcd (http://code.google.com/p/wcdest/) which makes a good job on clustering our data.

You can also use (for assembling your transcriptome dataset) MIRA2 assembler, http://chevreux.org/projects_mira.html

We ran into problems with TGICL with 454 data with very deep clusters (the clustering itself crashed, not the subsequent assembly). The author doesn't recommend TGICL for big NGS datasets.

hth,
Sven
sklages is offline   Reply With Quote
Old 05-21-2009, 03:20 AM   #6
Emanuel Heitlinger
Member
 
Location: Karlsruhe

Join Date: Mar 2009
Posts: 11
Default

Thanks!

I just downloaded the right binay, copied it to tgicl/bin and voila: It works! The clustering takes only very short time running on 6 processors... I still have to wait wait till CAP3 finishes.

MIRA looks very promising, their web page has great howtos and what they say in their paper on miraEST seems very promising.

SCARF looks promising as well. Although I am sure, that I will not have an organism being related closely enough (working on a notoriously long branching nematode).

I have to take a closer look into the literature as I am still a bit confused which tools do clustering, assembly or both... I will try to post a summary once I have a bit more of a clue.

Last edited by Emanuel Heitlinger; 05-21-2009 at 05:06 AM.
Emanuel Heitlinger is offline   Reply With Quote
Old 01-18-2010, 04:18 PM   #7
donniemarco
Member
 
Location: USA

Join Date: Aug 2009
Posts: 17
Default Wcd

I am trying to run WCD however I am not able to generate sequence cluster that is generated by WCD with default parameters as follows:
wcd -c -N 4 -o test.clust -g -t testSeq.fna


I get the output file test.clust but it does not have clustered sequences just the information as follows:
Index Cluster Link Orientation Witness
0 0 -1 1 -1
1 1 -1 1 -1

I looked further down and its shows the stats:

Number of clusters is 2023 with ave sq of len %
Size of Cluster Number of Clusters
1 1902
2 86
3 25
4 4
5 4
11 1
29 1

And further down below are just each cluster and the links:
...........
208.
210.
211.
212.
213 161 406.
..........

Am I missing any parameter. In the document, its written that sequences are generated. Any help will be highly appreciated.
donniemarco is offline   Reply With Quote
Old 01-18-2010, 08:40 PM   #8
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Quote:
Originally Posted by donniemarco View Post
[....]
And further down below are just each cluster and the links:
...........
208.
210.
211.
212.
213 161 406.
..........

Am I missing any parameter. In the document, its written that sequences are generated. Any help will be highly appreciated.
This output is all you need ... It tells you that the sequences with index 208, 210, 211 and 212 are singletons, whereas the sequences 213, 161 and 406 form a cluster with the given parameters (note that defaultly counting starts from "0").
It's now up to you to extract the cluster-forming sequences from your input file and assemble them with cap3, phrap, MIRA or whatever assembler ...

cheers,
Sven
sklages is offline   Reply With Quote
Old 01-20-2010, 08:08 AM   #9
donniemarco
Member
 
Location: USA

Join Date: Aug 2009
Posts: 17
Default

Hi Sven,

Thanks for your helpful reply.

If I extract all the cluster forming sequences say 213, 161 and 406 (actual sequence 214, 162, 407: thanks for pointing this as I had missed this) from the file and try to assemble them with say MIRA.

Do you not think that Mira not knowing these being one cluster will again be restricted to perform assembling the same way as these sequences would be assembled from original file (example the transitive closure property that assemblers lack).

I am thinking of using these cluster to generate single supercontig (and then probably rerun with my remainging singleton data with MIRA) however I am looking any particular program to do this. Any help would be appreciated.

Thanks!!

Quote:
Originally Posted by sklages View Post
This output is all you need ... It tells you that the sequences with index 208, 210, 211 and 212 are singletons, whereas the sequences 213, 161 and 406 form a cluster with the given parameters (note that defaultly counting starts from "0").
It's now up to you to extract the cluster-forming sequences from your input file and assemble them with cap3, phrap, MIRA or whatever assembler ...

cheers,
Sven
donniemarco is offline   Reply With Quote
Old 01-20-2010, 10:57 AM   #10
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Quote:
Originally Posted by donniemarco View Post
Hi Sven,

Thanks for your helpful reply.

If I extract all the cluster forming sequences say 213, 161 and 406 (actual sequence 214, 162, 407: thanks for pointing this as I had missed this) from the file and try to assemble them with say MIRA.
-Q int, --reindex int: reindex sequence ids.

may change the default indexing start.

Quote:
Do you not think that Mira not knowing these being one cluster will again be restricted to perform assembling the same way as these sequences would be assembled from original file (example the transitive closure property that assemblers lack).
Yes, I'd expect MIRA to assemble these reads similar as if done with the whole dataset, maybe. It's not always easy to directly compare different clustering/assembly approaches. It depends on what you are expecting.

For 454-generated EST datasets I currently use MIRA3 alone, quite successful IMHO. Datasets range from ~300,000 to 1,500.000 reads.

Quote:
I am thinking of using these cluster to generate single supercontig (and then probably rerun with my remainging singleton data with MIRA) however I am looking any particular program to do this. Any help would be appreciated.
Thanks!!
What do want to achieve? What is a 'supercontig' in this context?

A common way is to use a cluster program (wcd,tclust) and assemble these clusters with "cap3". This works quite fine. You could also use phrap.
I'd use MIRA as a "complete program to assemble ESTs".

If you want to further "join" contigs, just do so with cap3 or phrap
(although I would not do this with my datasets ;-))

Again, everthings depends on what you want / are expecting.

cheers,
Sven
sklages is offline   Reply With Quote
Old 01-21-2010, 01:58 PM   #11
donniemarco
Member
 
Location: USA

Join Date: Aug 2009
Posts: 17
Default

Thanks for the reply.

I will run assembling with mira and see how it goes. BTW, earlier I ran the same data with TIGR but it seems to have crashed while performing assembling with CAP3 (it finished clustering). I used 32gb RAM, 8 cpu processor so I think it was not memory contraint somehow. The no. of reads are approx. 900,000 with file size of 350mb.

Quote:
Originally Posted by sklages View Post
-Q int, --reindex int: reindex sequence ids.

may change the default indexing start.



Yes, I'd expect MIRA to assemble these reads similar as if done with the whole dataset, maybe. It's not always easy to directly compare different clustering/assembly approaches. It depends on what you are expecting.

For 454-generated EST datasets I currently use MIRA3 alone, quite successful IMHO. Datasets range from ~300,000 to 1,500.000 reads.



What do want to achieve? What is a 'supercontig' in this context?

A common way is to use a cluster program (wcd,tclust) and assemble these clusters with "cap3". This works quite fine. You could also use phrap.
I'd use MIRA as a "complete program to assemble ESTs".

If you want to further "join" contigs, just do so with cap3 or phrap
(although I would not do this with my datasets ;-))

Again, everthings depends on what you want / are expecting.

cheers,
Sven
Thanks!!!!!!!!
donniemarco is offline   Reply With Quote
Old 01-21-2010, 11:54 PM   #12
Emanuel Heitlinger
Member
 
Location: Karlsruhe

Join Date: Mar 2009
Posts: 11
Default

donniemarco,
did you read the whole thread? Look at my post about progblems with cap within tgicl! If yours are is the same problem, use the solution Sklages provided, it worked for me and for at least one other user of tgicl I know of.
Emanuel Heitlinger is offline   Reply With Quote
Old 01-22-2010, 07:37 AM   #13
donniemarco
Member
 
Location: USA

Join Date: Aug 2009
Posts: 17
Default

yes, i got the same error and i downloaded the cap3 specific to my machine as suggested by Sklages.
It seemed to have worked as it didn't gave the same error message about cap3 but crashed after 1.5-2 days with error message something as follows:

>>> Assemble [] started at
-Waiting for all children to finish before starting last child!!
-Waiting for all children to finish!!

Process terminated with an error at step 'Assemble' !

TGICL encountered an error at Step 'Aseemble'

I don't have any idea whether it crashed on assembling or last step of clustering when it seems to be waitng for last thread to return?

Quote:
Originally Posted by Emanuel Heitlinger View Post
donniemarco,
did you read the whole thread? Look at my post about progblems with cap within tgicl! If yours are is the same problem, use the solution Sklages provided, it worked for me and for at least one other user of tgicl I know of.
donniemarco is offline   Reply With Quote
Old 01-22-2010, 08:55 AM   #14
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

That's a problem with the TGICL, not with cap3 itself, it seems.
I ran into problems as well with TGICL, that's why I don't use it anymore :-)

cheers,
Sven
sklages is offline   Reply With Quote
Old 01-22-2010, 10:27 AM   #15
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,169
Default

Quote:
Originally Posted by sklages View Post
That's a problem with the TGICL, not with cap3 itself, it seems.
I ran into problems as well with TGICL, that's why I don't use it anymore :-)
Quote:
Originally Posted by donniemarco View Post
yes, i got the same error and i downloaded the cap3 specific to my machine as suggested by Sklages.
It seemed to have worked as it didn't gave the same error message about cap3 but crashed after 1.5-2 days with error message something as follows:

>>> Assemble [] started at
-Waiting for all children to finish before starting last child!!
-Waiting for all children to finish!!

Process terminated with an error at step 'Assemble' !

TGICL encountered an error at Step 'Aseemble'

I don't have any idea whether it crashed on assembling or last step of clustering when it seems to be waitng for last thread to return?
It could still be a problem with CAP3. All TIGCL knows is that CAP3 did not exit normally. There are error logs in each of the 'assemble_n' subdirectories. Look in those error logs to get a better idea of the actual error.
kmcarr is offline   Reply With Quote
Old 01-22-2010, 11:31 AM   #16
donniemarco
Member
 
Location: USA

Join Date: Aug 2009
Posts: 17
Default

Thanks, I looked into the error file and it seems to have run out of memory.

"************
Ran out of memory: 10911460019 bytes requested.
Error! cap3 failure detected (code=256) on: CL1
*************"

I also ran with another 16gig Ram machine but it seems to generate error there as well. I think I will try to upgrade the machine.

Quote:
Originally Posted by kmcarr View Post
It could still be a problem with CAP3. All TIGCL knows is that CAP3 did not exit normally. There are error logs in each of the 'assemble_n' subdirectories. Look in those error logs to get a better idea of the actual error.
donniemarco is offline   Reply With Quote
Old 01-22-2010, 12:06 PM   #17
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Quote:
Originally Posted by kmcarr View Post
It could still be a problem with CAP3. All TIGCL knows is that CAP3 did not exit normally. There are error logs in each of the 'assemble_n' subdirectories. Look in those error logs to get a better idea of the actual error.
Well, you are probably right ... I have not remembered very well :-)

I had a problem with zmsort,

Quote:
[...]
WAITING for all children to finish before starting last child!
WAITING for all children to finish!
<<< --- clustering [Pimp454FLXcomplete_2008-04-07_2008-04-08.fasta]
finished at May 22 03:40:16 2008
Error getting file size for 'zdir_cluster_1_002.Z'
Error at command:
zmsort -f11 -n -r -o zdir_cluster_1 -s 700 cluster_1/*.Z

Process terminated with an error, at step 'clustering'!
[...]
cap3 has problems assembling very deep clusters, runs into memory problems. You are right, having a closer at the log is a good idea :-)

cheers,
Sven
sklages is offline   Reply With Quote
Old 01-22-2010, 12:09 PM   #18
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

oh well, now i saw donniemarco's post. Memory :-)
16G is quite small ...

cheers,
Sven
sklages is offline   Reply With Quote
Old 03-08-2010, 10:16 PM   #19
ikim
Member
 
Location: US

Join Date: Mar 2010
Posts: 13
Default

Hi, I've come up with a very similar problem using TGICL/CAP3 for assembly, but I didn't think it was an issue with memory. Would anyone know if MiraEST can handle a 4 million+ read set? We are operating on a system with 64 GB workable ram.

Thanks!
IK
ikim is offline   Reply With Quote
Old 03-08-2010, 11:55 PM   #20
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

miramem could help estimating the amount of memory needed.

cheers,
Sven
sklages is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:16 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO