Go Back   SEQanswers > Applications Forums > De novo discovery

Similar Threads
Thread Thread Starter Forum Replies Last Post
Potentially basic question involving IGV blakeoft Bioinformatics 2 03-28-2014 05:26 AM
Annotation for contigs from de novo assembly witty Bioinformatics 9 03-02-2014 09:43 AM
best set of cDNA contigs assembly Sotino RNA Sequencing 0 11-26-2013 01:15 AM
PubMed: Reducing assembly complexity of microbial genomes with single-molecule sequen Newsbot! Literature Watch 0 09-17-2013 02:00 AM
Contigs Vs Scaffolds for Assembly Analysis narain Bioinformatics 5 10-14-2011 07:15 AM

Thread Tools
Old 04-15-2014, 01:01 PM   #1
Junior Member
Location: Washington

Join Date: Apr 2013
Posts: 3
Default Reducing potentially chimeric contigs during assembly

(Cross posted to the Trinity mailing list, but I wanted to see what SEQanswers thought about the problem)

I’m running an RNA-seq experiment using a de novo assembled transcriptome for a non-model organism (a beetle), where we have multiple treatments (diet and sex), and have 4 individuals per treatment. Furthermore we have sequenced 4 different tissues per individual (barcoded separately). I’ve encountered an interesting situation, and wanted some suggestions on how to resolve it. After using Trimmomatic to remove adapter sequences (but not sequence quality), and diginorm to normalize, I assembled the transcriptome in two different ways- first I generated assemblies from each individual specimen, using all the tissue libraries from that individuals, with the intention to combine all of the libraries together. Second, I also pooled all the reads (post diginorm) across all individuals (followed by a second round of diginorm) and then assembled a transcriptome from those reads. I modified the recommendations in the Nature Protocols (Haas et al 2013) paper slightly . (see below)

I used Trinity version: trinityrnaseq-r2013-02-25
In both methods of assembly generation I used the commands below, with the only difference being an increase in --JM to 60gb for the pooled assembly. --seqType fq --JM 20G --min_kmer_cov 2 --CPU 4 --left left.fq --right right.fq --min_contig_length 300

When I compare assembly metrics, something stood out to me: Each individually assembled transcriptome contained approximately similar amounts of components (~16,000 components per library). This is well in line with the number of “genes” with other related beetles (Tribolium and the dung beetles for instance). It is also quite similar to what we observed from our previous assembly based on 454 sequencing ( for this same species.

However the assembly that came from the reads pooled across individuals had an incredible amount of components (~40,000 components!). Clearly this is artificially high, almost certainly due to the degree of polymorphism among individuals. Yet we want a single transcriptome (this transcriptome will be used for mapping reads for differential expression analysis, at least at the gene (well, component) level.

My question is this: what sorts of parameters should I vary when using Trinity to reduce chimeric transcript reconstruction that are likely due to polymorphism ? I’m not specifically concerned about alternative transcripts at the moment, just generating a more biologically reasonably set of components not inflated due to polymorphism. More specifically I guess I want to know how to make the component generation and selection process more conservative.

Would running Trinity with the –CuffFly option reduce the number of components generated, or does that only affect the alternatively spliced transcripts? Similarly, do the parameters underneath the options –min_per_id_same_path (and related) affect alternative splice variants?

Or is it a better idea to run Inchworm with the –jaccard_clip flag?

As I mentioned above, the data I’m using is adapter trimmed (Trimmomatic), normalized (diginorm), Illumina 50bp paired-end data. Thanks in advance!

Rzinna is offline   Reply With Quote
Old 04-16-2014, 02:43 AM   #2
Location: Cambridge, UK

Join Date: Dec 2011
Posts: 48

Firstly, I would say 40,000 is not at all an unrealistic number of genes. Your previous assemblies are quite likely to be massive underestimates of the true set of transcripts. 454-based assemblies tend to reconstruct many fewer transcripts, and there is no reason to assemble each sample individually - pooling is much more likely to assemble a higher proportion of true transcripts. I would trust your pooled assembly more than any of the others.

Secondly, why didn't you do any quality trimming of your reads? You should inspect the read quality distributions with fastqc or a similar tool, as almost all read sets require a bit of quality trimming. There's some argument that very stringent trimming is a bad thing, but no trimming at all will lead to errors causing problematic false isoforms.

Finally, why do you think polymorphism would cause chimeras? It should cause bubbles in the graph, which will lead to more isoforms, but not more components or chimeras. If you do have chimeras, the best thing to do with them is to split them after the assembly. If you have isoform inflation due to polymorphisms, you can collapse those by clustering with CD-HIT-EST and ID set to 99.

Last edited by Blahah404; 04-16-2014 at 02:47 AM.
Blahah404 is offline   Reply With Quote
Old 04-16-2014, 11:14 AM   #3
Junior Member
Location: East Lansing, Michigan

Join Date: Jul 2013
Posts: 1

Thanks so much for the comments.

The reason we think the 40,000 components ("genes") is suspect is based on the observations from many related species of beetles. From both previous transcriptome assemblies, and importantly reasonably well assembled/annotated genomes like Tribolium (the Flour beetle all are consistent with the ~16,000 number for genes.

As far as we know there has been no whole genome duplication (or triplication) leading up to the lineage we are study, and our previous analysis and assembly from 454 data (also across multiple individuals) also had a similar number of genes for this species.

It is only when we take all of the samples ("pooled" then run through diginorm) and do a Trinity assembly that the problem occurs. While it is possible that we somehow get ~16,000 components when we do assemblies for each individual separately, this seems unlikely given the vagaries of sequencing depth. Based on both the a priori considerations and our own recent observations it suggests that the problem has to do with the assembly with the pooled data alone, and thus the most likely culprit is genetic variation among individuals causing reads to be assembled into multiple components when they simply vary based upon genetic variation.

As for your other question (why did we only trim the adapters, but not for sequence quality). This is based on some discussions that started with these posts ( & and paper ( We plan to go back and do some light quality trimming as a check.
idworkin is offline   Reply With Quote
Old 04-17-2014, 10:22 AM   #4
Junior Member
Location: indiana

Join Date: Jun 2012
Posts: 9

One general comment: worry first about getting the correct gene assemblies before you worry about a proper number of genes. Each of your treatments and tissues is expected to express a somewhat different gene set. You will need to do orthology measures on any final gene collection, but each of your assemblies will have some of the best models for some loci.

You can find advice and software here on how to best select your beetle genes from multiple transcript assemblies

This includes a pine beetle example and other insects, plants, animals (the pine beetle mRNA-assembled gene set is more ortho-complete than a pine beetle genome-gene set, where ortho-complete means both more ortholog loci and longer, fuller proteins). If you add more assembly methods, eg. oases/velvet, soap-trans, and data slices, that will give you the most complete gene set, after selecting out the best models of each locus from your input assembly. I find repeatably that velvet/oases and soapdenovo-trans give more complete gene sets than trinity, but drawing on them all gives you the most complete set. I typically need to generate several million transcript assemblies to get "just right" accurate gene sets of animals and plants.
dongilbert is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 05:15 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO