![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
how to use CLC bio software for assemly with SOLiD | noas | SOLiD | 3 | 12-08-2011 06:54 AM |
Difference between 454 pair end library and 454 standard library | edge | 454 Pyrosequencing | 1 | 09-30-2009 01:08 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: leeds Join Date: Jan 2009
Posts: 6
|
![]()
Hello,
I have been given two sets 454 data which have already been assembled into contigs using the newbeler assembler. The data is transcript sequence from two different flower types, from a plant species which has not yet had its genome sequenced. They want me to compare the expression levels between two sets of data by comparing the number of reads which were used to make up matching contigs. The coverage isn't that great so some genes have multiple contigs. I can tell this as their are many cases several smaller contigs from one sample (but still in the 100's) align with a larger ones in the other. I thought the best way to deal with this would be to pool the reads from the two samples and assemble them. Then compare the number of reads from each sample that make up each of the contigs. I don't have access to newbeler so I tired re-assembling one of the samples with velvet. But this resulted in 25,091 contigs whereas newebler had produced 13,990. Does anyone know which of the publicly available assemeblers works bets on 454 data? Or has anyone got an suggestion on how to deal with this data? Thanks, John |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
John,
You are absolutely on the right track in wanting to do a single, unified assembly of the data. We do exactly the type of analysis you are trying on a very regular basis. We haven't used Newbler for transcript assembly in quite a while. Here is the procedure we use: Clean the hell out of your raw sequences; trimming polyA (or polyT), vector/adapter sequences and low quality and low complexity regions. It certainly helps to know what procedures/kits/adapters were used in creating the cDNA library used for 454 sequencing so that you can limit screening steps to just those. We first run cross_match to do vector screening. The screened output is then input to the SeqClean. SeqClean (http://compbio.dfci.harvard.edu/tgi/software/) is a pipeline originally created at TIGR for cleaning EST sequences. After cleaning the reads are fed into the assembly pipeline TGI Clustering Tools (TGICL, also available at the URL above). This is another pipeline first developed by TIGR for clustering and assembling ESTs for their Gene Index project. It calculates pairwise similarity scores for all possible pairwise comparisons. It then performs a transitive clustering of the reads based on these similarity scores. Finally, it assembles each cluster using CAP3. We use parameters a little more stringent than the defaults (minimum overlap and percent identity). At this stage any singletons are set aside and not considered further. All of the contigs created are then assembled together using CAP3, with more relaxed parameters than the first round. You will still end up with multiple contigs which are very similar. The two stage assembly does add an extra layer of complexity when you are trying to track reads. Since the assembly components of the second round would be contigs themselves you have to track back to which reads made up those contigs from the first round assembly. If you decide that you do not want to do an entire new assembly I do have an alternative. As you have discovered you will never be able to make a 1-to-1 matching of contigs but you could try to create groups of contigs from the two assemblies. A useful program to do this is blastclust, which is part of the standard NCBI blast toolkit. The grouping can be very stringent (e.g. only finding orhtologous sequences) or more relaxed (grouping sequences from gene families) based on the adjusting the two primary scoring parameters -L and -S. In a situation like yours you will have to be careful with -L parameter. This parameter controls what percentage of the shorter sequence must overlap the longer one. Blastclust was written assuming assuming that people would be comparing complete sequences (transcripts or proteins) so that one sequence should be 'contained' within the other. This is not true for your incomplete transcript assemblies. I rambled on for quite some time here, I hope you find some of this information useful. |
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: leeds Join Date: Jan 2009
Posts: 6
|
![]()
Thanks for your detailed reply.
I had been thinking about clustering the contigs today. I have used blastclust before also I was having a look at CLOBB (http://www.biomedcentral.com/1471-2105/3/31/abstract) as it seems to be similar to blastclust but optimised for EST clustering. Also I found a program called wcd (http://bioinformatics.oxfordjournals...ull/24/13/1542) that is also designed EST data. I think I'll give the clustering a go. If I have lots of problems then I might look into using the pipeline you suggested. |
![]() |
![]() |
![]() |
#4 |
Member
Location: Karlsruhe Join Date: Mar 2009
Posts: 11
|
![]()
Hi,
I am just following kmcarr's advice using TGICL to cluster ~75.000 ~250nt sequences from a GS FLX run. CAP3 has a problem, giving the following error: *** glibc detected *** cap3: corrupted double-linked list: 0x09724b98 ***this is repeated several times... It is the same error whether I use multiple processors (-c option) or just one. Has anybody an idea what to do? I'm working on a 2x quad-core, 16GB RAM running fedora10.x86_64. Last edited by Emanuel Heitlinger; 05-20-2009 at 07:08 AM. |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]()
Maybe it's a good idea to get the latest cap3 binary for your platform at http://seq.cs.iastate.edu/
If you want to cluster before assembling you could try wcd (http://code.google.com/p/wcdest/) which makes a good job on clustering our data. You can also use (for assembling your transcriptome dataset) MIRA2 assembler, http://chevreux.org/projects_mira.html We ran into problems with TGICL with 454 data with very deep clusters (the clustering itself crashed, not the subsequent assembly). The author doesn't recommend TGICL for big NGS datasets. hth, Sven |
![]() |
![]() |
![]() |
#6 |
Member
Location: Karlsruhe Join Date: Mar 2009
Posts: 11
|
![]()
Thanks!
I just downloaded the right binay, copied it to tgicl/bin and voila: It works! The clustering takes only very short time running on 6 processors... I still have to wait wait till CAP3 finishes. MIRA looks very promising, their web page has great howtos and what they say in their paper on miraEST seems very promising. SCARF looks promising as well. Although I am sure, that I will not have an organism being related closely enough (working on a notoriously long branching nematode). I have to take a closer look into the literature as I am still a bit confused which tools do clustering, assembly or both... I will try to post a summary once I have a bit more of a clue. Last edited by Emanuel Heitlinger; 05-21-2009 at 05:06 AM. |
![]() |
![]() |
![]() |
#7 |
Member
Location: USA Join Date: Aug 2009
Posts: 17
|
![]()
I am trying to run WCD however I am not able to generate sequence cluster that is generated by WCD with default parameters as follows:
wcd -c -N 4 -o test.clust -g -t testSeq.fna I get the output file test.clust but it does not have clustered sequences just the information as follows: Index Cluster Link Orientation Witness 0 0 -1 1 -1 1 1 -1 1 -1 I looked further down and its shows the stats: Number of clusters is 2023 with ave sq of len % Size of Cluster Number of Clusters 1 1902 2 86 3 25 4 4 5 4 11 1 29 1 And further down below are just each cluster and the links: ........... 208. 210. 211. 212. 213 161 406. .......... Am I missing any parameter. In the document, its written that sequences are generated. Any help will be highly appreciated. |
![]() |
![]() |
![]() |
#8 | |
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]() Quote:
It's now up to you to extract the cluster-forming sequences from your input file and assemble them with cap3, phrap, MIRA or whatever assembler ... cheers, Sven |
|
![]() |
![]() |
![]() |
#9 | |
Member
Location: USA Join Date: Aug 2009
Posts: 17
|
![]()
Hi Sven,
Thanks for your helpful reply. If I extract all the cluster forming sequences say 213, 161 and 406 (actual sequence 214, 162, 407: thanks for pointing this as I had missed this) from the file and try to assemble them with say MIRA. Do you not think that Mira not knowing these being one cluster will again be restricted to perform assembling the same way as these sequences would be assembled from original file (example the transitive closure property that assemblers lack). I am thinking of using these cluster to generate single supercontig (and then probably rerun with my remainging singleton data with MIRA) however I am looking any particular program to do this. Any help would be appreciated. Thanks!! Quote:
|
|
![]() |
![]() |
![]() |
#10 | |||
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]() Quote:
may change the default indexing start. Quote:
For 454-generated EST datasets I currently use MIRA3 alone, quite successful IMHO. Datasets range from ~300,000 to 1,500.000 reads. Quote:
A common way is to use a cluster program (wcd,tclust) and assemble these clusters with "cap3". This works quite fine. You could also use phrap. I'd use MIRA as a "complete program to assemble ESTs". If you want to further "join" contigs, just do so with cap3 or phrap (although I would not do this with my datasets ;-)) Again, everthings depends on what you want / are expecting. cheers, Sven |
|||
![]() |
![]() |
![]() |
#11 | |
Member
Location: USA Join Date: Aug 2009
Posts: 17
|
![]()
Thanks for the reply.
I will run assembling with mira and see how it goes. BTW, earlier I ran the same data with TIGR but it seems to have crashed while performing assembling with CAP3 (it finished clustering). I used 32gb RAM, 8 cpu processor so I think it was not memory contraint somehow. The no. of reads are approx. 900,000 with file size of 350mb. Quote:
|
|
![]() |
![]() |
![]() |
#12 |
Member
Location: Karlsruhe Join Date: Mar 2009
Posts: 11
|
![]()
donniemarco,
did you read the whole thread? Look at my post about progblems with cap within tgicl! If yours are is the same problem, use the solution Sklages provided, it worked for me and for at least one other user of tgicl I know of. |
![]() |
![]() |
![]() |
#13 |
Member
Location: USA Join Date: Aug 2009
Posts: 17
|
![]()
yes, i got the same error and i downloaded the cap3 specific to my machine as suggested by Sklages.
It seemed to have worked as it didn't gave the same error message about cap3 but crashed after 1.5-2 days with error message something as follows: >>> Assemble [] started at -Waiting for all children to finish before starting last child!! -Waiting for all children to finish!! Process terminated with an error at step 'Assemble' ! TGICL encountered an error at Step 'Aseemble' I don't have any idea whether it crashed on assembling or last step of clustering when it seems to be waitng for last thread to return? |
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]()
That's a problem with the TGICL, not with cap3 itself, it seems.
I ran into problems as well with TGICL, that's why I don't use it anymore :-) cheers, Sven |
![]() |
![]() |
![]() |
#15 | ||
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
|
||
![]() |
![]() |
![]() |
#16 |
Member
Location: USA Join Date: Aug 2009
Posts: 17
|
![]()
Thanks, I looked into the error file and it seems to have run out of memory.
"************ Ran out of memory: 10911460019 bytes requested. Error! cap3 failure detected (code=256) on: CL1 *************" I also ran with another 16gig Ram machine but it seems to generate error there as well. I think I will try to upgrade the machine. |
![]() |
![]() |
![]() |
#17 | ||
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]() Quote:
I had a problem with zmsort, Quote:
cheers, Sven |
||
![]() |
![]() |
![]() |
#18 |
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]()
oh well, now i saw donniemarco's post. Memory :-)
16G is quite small ... cheers, Sven |
![]() |
![]() |
![]() |
#19 |
Member
Location: US Join Date: Mar 2010
Posts: 13
|
![]()
Hi, I've come up with a very similar problem using TGICL/CAP3 for assembly, but I didn't think it was an issue with memory. Would anyone know if MiraEST can handle a 4 million+ read set? We are operating on a system with 64 GB workable ram.
Thanks! IK |
![]() |
![]() |
![]() |
#20 |
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]()
miramem could help estimating the amount of memory needed.
cheers, Sven |
![]() |
![]() |
![]() |
Thread Tools | |
|
|