Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
De novo transcriptome assembly using Trinity but for single-end reads? Sebastian_Quezada_R RNA Sequencing 5 09-02-2016 12:31 AM
De novo assembly using Trinity ankitarathore RNA Sequencing 5 10-28-2014 09:03 AM
CLC Genomics Workbench slow in de novo assembly jyuems Complete Genomics 6 10-03-2014 04:50 PM
de novo transcriptome Trinity Assembly concatenamers Dampor Bioinformatics 2 01-14-2014 02:34 AM
De novo hybrid assembly of 454/illumina : CLC workbench Bardj Bioinformatics 1 11-21-2010 05:14 PM

Thread Tools
Old 04-08-2017, 01:02 PM   #1
Junior Member
Location: Sydney, Australia

Join Date: Jan 2017
Posts: 3
Lightbulb CLC Bio vs. Trinity for de novo transcriptome assembly

Hi there,

I have some questions regarding CLC Bio vs. Trinity for de novo RNA seq assembly. When I was told how to process my data by a postdoc in my lab, he insisted on using CLC Bio, on the default setting. I've looked into CLC Bio and I'm not sure if this is the correct way to go about things. But before I step away from a published methodology, I have to convince my supervisor too. That's where I hope you can help me.

The data:
1/ Eukaryotic, single celled algae - dinoflagellates. The phylum is particularly known for bizaare genetic elements:
- 0.5 to 40 x genetic content of human haploid genome
- mRNA frequently reinserted into genome, i.e.. a mine field of truncated paralogs. They are the hoarders of the genetic world.
- ancient lineage, they've had a long time to accumulate paralogs. Some rDNA genes have in excess of 2000 copies, most phylogenetic analyses of the order that I work with are rubbish because of this.
- they have a different, still unknown mode of gene regulation, appears to be post-transcriptional. I.e. mRNA seq data is massive and gives us a pretty good idea about the genome. We think.
- hence, no reference genomes or even transcriptomes available.
2/ Working with sequencing data from both public database (MMETSP) and my own work. Some of the former is really quite low quality.
- public: Illumina Hi-Seq 2000, PE, 50bp inserts
- mine: Nextseq500, PE, 75bp inserts, HO
- mine, second round of sequencing occurring now: Nextseq500, PE, 150bp inserts, HO

The Problem:
I've come across someone else's (Lisa Cohen, github - really cool project) usage of the publicly available data, using Trinity and then the same quality control assessment that I had run - BUSCO (looks for single copy genes via hmmer libraries, successor of CEGMA). So I have a direct comparison point between the BUSCO score of my CLC Bio assemblies vs. her Trinity assemblies using the same RNA seq libraries. Hers are better across the board for single copy hits. Some transcriptomes only by 2 genes, but in one or two transcriptomes the difference is 50 single copy genes out of the 450 tested.

The questions:
- what is the general knowledge/feeling about CLC Bio and Trinity? Preferences or horror stories?
- is either of the assemblers known for making mistakes?
- more directly, is either of them partial to misassembly of paralogs - if one gives me more single copy genes, is that a 'true' result or are they actually a mash up of paralogs?

Thanks, y'all!

Last edited by nurgling; 04-08-2017 at 01:07 PM.
nurgling is offline   Reply With Quote
Old 04-12-2017, 08:09 PM   #2
Junior Member
Location: Hobart, Australia

Join Date: May 2015
Posts: 1
Default There is a difference

Yes, there is a difference in the quality/completeness of the assembled transcriptomes (Cegma and BUSCO). I would suggest to use Trinity and then follow you pipeline with/without CLC.
juadiegaitan is offline   Reply With Quote
Old 04-13-2017, 01:17 AM   #3
David Eccles (gringer)
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 823

what is the general knowledge/feeling about CLC Bio and Trinity
CLC Bio: a black box, costs lots of money, can't be changed/modified without Qiagen's approval

Trinity: over 1000 citations, free and open source, a prescriptive published protocol for de-novo assembly and DE analysis. I hacked the code a little to get it to work on my desktop computer.

is either of the assemblers known for making mistakes?
All assemblers make mistakes. Ignoring algorithmic errors (which are potentially fixable), it's impossible to resolve repeats that are longer than the template length (and/or with a repeat unit that is longer than the read length). Sequencers make mistakes which the assemblers can propagate. Transposable elements mess up assemblies if they occur at multiple points throughout the genome. Assembly of single cells will be incomplete. Assembly of pooled multiple cells (or organism populations) will have cell-specific variation. Transcriptome assemblies based on poly-A selected transcripts will be incomplete. Transcriptome assemblies will be incomplete for varying levels of incompleteness based on what genes are activated at the time of sampling.

more directly, is either of them partial to misassembly of paralogs - if one gives me more single copy genes, is that a 'true' result or are they actually a mash up of paralogs?
While it might be possible to resolve paralogs if they have different expression levels (which are consistent throughout the transcript). You need to do a genome-guided assembly to have any hope of properly assembling paralogs with shared sequence.
gringer is offline   Reply With Quote
Old 04-14-2017, 05:01 PM   #4
Senior Member
Location: US

Join Date: Dec 2010
Posts: 344

Trinity is the standard for de novo transcriptome assemblies. Thus also the artifacts it produces are relatively wellk nown.
Sorry, I have never used CLC for this purpose. I would suggest to contact CLC for suggested settings for transcriptomes (I can't imagine the defaults are optimal).
For genome assemblies CLC has the advantage that it will work with all kinds of data (all kinds of read lengths, paired or not paired, and even low quality data). In short it is extremely robust for this purpose.
luc is offline   Reply With Quote
Old 04-24-2017, 10:00 PM   #5
Senior Member
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 163

I contacted QIAGEN support last year about this topic. CLC Genomics Workbench has no specific algorithm for assembling RNA-seq data. The support officer explained:

The CLC de novo assembly tool was designed with genomic data in mind. At the moment we have no tool that is specific to transcriptomic data assembly. This means that there is no step or action that explicitly handles cases of alternative splicings.
Also, CLC Genomics Workbench ignores RNA-seq strandedness.

You cannot utilize the strand-specific information in the RNA-seq data for the de novo assembly*job. So, it does not matter if you have unstranded data.
You'd be silly to choose CLC Genomics Workbench instead of Trinity for transcript assembly. CLC Genomics Workbench is so behind the times it can't even export sorted and indexed BAM files to disk.

BAM format files exported from the Workbench are not sorted nor indexed. If pairs are not on the same contig, the mates will be exported as single reads.
Dario1984 is offline   Reply With Quote
Old 07-05-2017, 07:09 PM   #6
Junior Member
Location: Sydney, Australia

Join Date: Jan 2017
Posts: 3
Default Reply to Dario1984

Dear Dario,

Sorry for not posting a response. My phone selectively doesn't submit things, and responding to you was one of those.

Thanks you! This is exactly what I was looking for and was what I needed to convince my boss to switch away from CLC Bio.

nurgling is offline   Reply With Quote

clc bio, de novo transcriptome, trinity

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 02:48 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO