SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
de novo transcriptome assembly/RNA-seq samanta General 0 08-24-2011 12:07 PM
De Novo assembly of a plant transcriptome raonyguimaraes RNA Sequencing 7 07-05-2011 01:17 PM
De Novo Transcriptome Assembly QC Noremac General 0 05-19-2011 11:02 AM
de novo transcriptome assembly Niharika Introductions 8 02-07-2011 05:29 AM
de novo transcriptome assembly chenjy RNA Sequencing 4 12-06-2010 11:54 PM

Reply
 
Thread Tools
Old 04-27-2009, 06:13 PM   #1
Neil
Junior Member
 
Location: Australia

Join Date: Mar 2009
Posts: 1
Default De Novo Assembly of a transcriptome

Hi all,
We are planning to perform an mRNA-seq run using the Illumina GAII platform. We are worried about assembling the transcriptome when we get our data back. Most of the RNA-seq papers I read are assembling to a reference genome/transcriptome, we don't have either of these! Is there anyone out there that has assembled cDNA short reads de novo? If so, are paired reads as important as they are with genome assembly?
Is there an example database of mRNA-seq short-pair reads that i can download to simulate assembly?
also, what software would you recommend for this?
hope someone can help
best regards
neil
Neil is offline   Reply With Quote
Old 04-28-2009, 01:02 AM   #2
Rao
Member
 
Location: India

Join Date: Oct 2008
Posts: 36
Default

Check for ESTs may help you in assembly
de novo assembly of transcriptome.... what about misassemblies...
Rao is offline   Reply With Quote
Old 05-11-2009, 11:39 AM   #3
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 88
Default

Though I haven't finished the project (reads aren't all in yet), I'm doing something similar right now: no reference transcriptome, but looking for SNPs in cDNA reads of two subspecies. The first was sequenced with single-ended reads, and resulted in pretty short contigs, and only roughly 1/10 of the trancriptome total was assembled. I'm recommending paired-ends for the second sample, so I may have a quantitative answer for you in a couple of weeks.

The transcriptome may have more unique, assemblable sequence than the genome .. but homologous domains will be a problem, and paired-ends would definitely help there. That's why I'd guess that a small insert library should help quite a bit.

I'd recommend velvet - seems to still be the best option out there for Illumina reads. Not sure on simulation ...
jnfass is offline   Reply With Quote
Old 05-12-2009, 05:33 AM   #4
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 124
Default

A year ago, de novo transcriptome sequencing solely based on Illumina GAII is a bad idea. With 72bp PE reads and higher coverage, nothing is impossible now.

Like what Rao suggested, EST data will be helpful for the assembly. But, the fact is most organisms of interest don’t have comprehensive EST information. No available reference genome/ transcriptome (not even from a related species). You don’t know the exact size of the transcriptome, repeats, paralogous genes and isoforms problem. It’s tricky to tell even if your assembly went wrong. Like I said, it depends on the purpose of sequencing. Things is a lot easier if the goal is to discover SNPs. If the results are not satisfying, try other alternatives like sequencing using longer reads.
Melissa is offline   Reply With Quote
Old 05-12-2009, 07:11 AM   #5
jordi
Member
 
Location: València, Spain

Join Date: Apr 2009
Posts: 48
Default

Hi all!
I'm doing the annotation of a transcriptome of a non reference organism, something similar like you. My assembly was made with GS de novo assembler, but I had short contigs...
I'm trying the assembly with Mosaik but prior I have another problem: what about transposable elements? Have you tried to use windowmasker? Or RepeatMasker? For an organism without a database for these repetitives elements, which program do you think is better?
Thanks!
jordi is offline   Reply With Quote
Old 05-12-2009, 08:36 AM   #6
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 124
Default

Quote:
Originally Posted by jordi View Post
Hi all!
I'm doing the annotation of a transcriptome of a non reference organism, something similar like you. My assembly was made with GS de novo assembler, but I had short contigs...
I'm trying the assembly with Mosaik but prior I have another problem: what about transposable elements? Have you tried to use windowmasker? Or RepeatMasker? For an organism without a database for these repetitives elements, which program do you think is better?
Thanks!
Why would you worry about transposable/repetitive elements in the transcriptome? The common repeats found in transcriptome are SSR and low complexity region. I'm not refering to the repeats that are several kb long (like in the genome). But if these repeats are transcribed, then yes, you will find them in the transcriptome.
Melissa is offline   Reply With Quote
Old 05-12-2009, 09:24 AM   #7
jordi
Member
 
Location: València, Spain

Join Date: Apr 2009
Posts: 48
Default

Because if you haven't a large coverage and the same repetitive elements could appears in different genes, how do I know which protein has been translated? So, I would mask these elements.
The low coverage has been my problem with Standard GS de novo assembler. Length contigs aprox 200 bp and a coverage from 4X to 6X.
Thanks!
jordi is offline   Reply With Quote
Old 05-12-2009, 09:31 AM   #8
jordi
Member
 
Location: València, Spain

Join Date: Apr 2009
Posts: 48
Default

oh, sorry. I found repetitive elements which are reverses transcriptases, located at 3' UTR of different genes. How can I differenciate the origin of my blast results?
jordi is offline   Reply With Quote
Old 05-12-2009, 09:18 PM   #9
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 124
Default

Quote:
Originally Posted by jordi View Post
oh, sorry. I found repetitive elements which are reverses transcriptases, located at 3' UTR of different genes. How can I differenciate the origin of my blast results?
The only way to tell a 3' UTR is the presence of polyA tail at sequence end. Considering our contigs are short, are you sure this is not misassemblies? How long is the repetitive element you found and what's the similarity?

If you are using blast to annotate your contigs, using 3' UTR is not a good idea because that region can varies even within the same species.

I have used CENSOR to find repeats in my ESTs but there's no significant hits. Most hits are around 100bp with 80% similarity (The original genomic repeat is several kb long) and it only exist once in the ESTs. Maybe plants repeat databases are not well-characterized. In the end, I just ignore them.

Found a related thread on repeat at
http://seqanswers.com/forums/showthread.php?t=1504
Melissa is offline   Reply With Quote
Old 12-01-2009, 11:28 PM   #10
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 199
Default

So in short no one has done de novo transcriptome assembly for new organism before?
can we use a closely related species like fish to do that for de novo?

how about taking it further with doing expression profiling on the new organism?
KevinLam is offline   Reply With Quote
Old 12-02-2009, 06:36 AM   #11
Marta
Member
 
Location: Davis, CA

Join Date: Oct 2009
Posts: 17
Default

We assembled lettuce transcriptome using 85 nt IGA single reads. We used CLC and Velvet followed by CAP3.
Marta is offline   Reply With Quote
Old 12-02-2009, 02:00 PM   #12
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

While this is not de novo assembly of a novel transcriptome, in some ways it is better because it can be compared against a known transcriptome (which was not used in the assembly as far as I know

http://bioinformatics.oxfordjournals...&pmid=19528083
Bioinformatics. 2009 Nov 1;25(21):2872-7. Epub 2009 Jun 15.
De novo transcriptome assembly with ABySS.
Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJ.

Genome Sciences Centre, Vancouver, BC V5Z 4S6, Canada. ibirol@bcgsc.ca
MOTIVATION: Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. RESULTS: Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled approximately 194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. AVAILABILITY AND IMPLEMENTATION: Source code and binaries of ABySS are freely available for download at http://www.bcgsc.ca/platform/bioinfo/software/abyss. Assembler tool is implemented in C++. The parallel version uses Open MPI. ABySS-Explorer tool is implemented in Java using the Java universal network/graph framework. CONTACT: ibirol@bcgsc.ca.

PMID: 19528083
krobison is offline   Reply With Quote
Old 12-15-2009, 10:04 PM   #13
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 199
Default

Quote:
Originally Posted by Marta View Post
We assembled lettuce transcriptome using 85 nt IGA single reads. We used CLC and Velvet followed by CAP3.
Are your results published in a paper already? Would love to read it!
KevinLam is offline   Reply With Quote
Old 12-16-2009, 07:43 AM   #14
Peter Bjarke Olsen
Junior Member
 
Location: Denmark

Join Date: Jan 2009
Posts: 8
Default

We have done several de Novo transcriptome projects mainly using Illumina technology and the Abyss assembler. In general it works but the problem is getting full length sequences (from start to stop codon). We have recently learned that some labs uses coligation of the transcipts prior to the nebulization. It should increase the number of full length genes. The reason is that the fragmentation is non random at the ends making the ends underrepresented in the library.
Peter Bjarke Olsen is offline   Reply With Quote
Old 12-16-2009, 10:02 AM   #15
Marta
Member
 
Location: Davis, CA

Join Date: Oct 2009
Posts: 17
Default

KevinLam,

The data is unpublished. We are re-assembling the reads using the latest version of CLC assembler and Velvet with adjusted parameters. The number of transcriptome contigs in our latest assemblies went down from ~70K to ~57K. I have a presentation on-line with results from last summer assemblies here:
https://docs.google.com/fileview?id=...MWYzNjkz&hl=en

Since we assembled correctly the longest genes in plants including BIG (>15 kb) we believe the approach works.

More technical notes on filtering the reads and Velvet parameters used are here:
http://atgc-illumina.googlecode.com/...k_090910_D.pdf
Marta is offline   Reply With Quote
Old 12-16-2009, 06:54 PM   #16
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

From section 4.2 and 4.3 of the new CLC white paper, it appears that the old CLC assembler made slightly longer contigs (unpaired max CLC69kbp vs VEL60kbp, N50 CLC23kbp vs VEL16kbp) at the expensive of more incorrect ones (CLC: 36 wrong, VEL :1 wrong). The newer one leans too far the other way. Who knows what velvet parameters were used - probably the ones that most closely matched the total CLC assembly size.
http://www.clcbio.com/files/whitepap...C_NGS_Cell.pdf

I'm not so sure there is a free lunch here.

Marta, what cvCut and expCov parameters did you use in your Velvet assemblies? The cvCut parameter has a huge effect on N50, assembly size, and read usage.
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter

Last edited by Zigster; 12-16-2009 at 07:16 PM.
Zigster is offline   Reply With Quote
Old 12-16-2009, 09:22 PM   #17
Marta
Member
 
Location: Davis, CA

Join Date: Oct 2009
Posts: 17
Default

The experiment CLC did for this white paper does not reflect the actual performance of the CLC assembler. I think the assembler is much better than what the paper claims.

I use CLC Genomics WorkBench on Windows with 32GB RAM. A few days ago I started to test the latest (beta) version of the assembler for Workbench. It performs much better than the older one. My input is 92.5 Million of transcriptome single reads that are up to 85 nt long (IGA, filtered fasta).

About Velvet - my understanding that there is not much sense in changing expCov for transcriptome reads. We work with normalized mRNA libraries, but still the coverage between different transcrips varies a lot. About cvCut you need to contact alex_kozik (he is a member here). He is the one who ran all Velvet assemblies on the same set.
Marta is offline   Reply With Quote
Old 02-05-2010, 07:12 PM   #18
MarcelS
Junior Member
 
Location: Berlin, Germany

Join Date: Feb 2010
Posts: 5
Default

Quote:
Originally Posted by Neil View Post
Hi all,
also, what software would you recommend for this?
hope someone can help
best regards
neil
Hi Neil,
I would recommend our new software Oases see the thread Oases: De novo transcriptome assembly of very short reads or http://www.ebi.ac.uk/~zerbino/oases/.
The software is designed to cope with alternative splicing and repetitive regions that normally break up contigs (for example if genome assemblers are used). Oases can produce full length transcripts if the coverage allows it and does also support/exploit paired-end information. And yes, paired-end information does improve the results. Oases already supports longer reads (e.g. 75 bp) that are produced by the current technologies.

Bests,
Marcel
MarcelS is offline   Reply With Quote
Old 05-17-2010, 04:20 AM   #19
blackgore
Member
 
Location: UK

Join Date: Sep 2009
Posts: 20
Default

How are people evaluating their transcriptome assemblies? The standard N50 assessment can't be that useful, as the goal here isn't exactly to generate a tiny set of huge contigs...?
blackgore is offline   Reply With Quote
Old 05-17-2010, 06:20 AM   #20
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 124
Default

Interesting question, Blackgore! Without a reference/gene model/ESTs, how to evaluate a de novo transcriptome assembly?

Quote:
Originally Posted by Marta View Post
Since we assembled correctly the longest genes in plants including BIG (>15 kb) we believe the approach works.

More technical notes on filtering the reads and Velvet parameters used are here:
http://atgc-illumina.googlecode.com/...k_090910_D.pdf
I also found a 15kb contig homologs to Arabidopsis BIG/ubiquitin-protein ligase in my plant transcriptome. I was told that similar result is obtained in P.trichocarpa. Therefore, I think BIG/ubiquitin-protein ligase can serve as an indicator for plant transcriptome assembly. Long genes like BIG/ubiquitin-protein ligase won't be assembled in poorly sequenced transcriptome.

Anyway, both methods (including N50) doesn't say much about the scaffolds quality. There can be scaffolds with lots of Ns due to poorly sequenced insert gaps. Compare two datasets with the same N50 and longest contig but one with lots of Ns, how can you tell the difference?
Melissa is offline   Reply With Quote
Reply

Tags
de novo assembly, illumina, short read length, transcriptomes

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:39 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO