Two related transcriptomes: merging but avoiding fake fusion transcripts

danwiththeplan

Member

Join Date: Sep 2011

Posts: 72
- Share
- Tweet
#1

Two related transcriptomes: merging but avoiding fake fusion transcripts

01-30-2014, 04:48 PM

Hi. I'd like to discuss a situation that has been partially discussed in this thread:

Merging transcripts of two genotypes - SEQanswers

http://seqanswers.com/forums/showthread.php?t=33917

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

I have RNAseq data from two cultivars of a plant which is polyploid. I've taken the approach of doing a de novo transcriptome assembly separately for each cultivar.

Examining the transcriptomes reveals that about 1/3 of the transcripts are unique to a cultivar, and 2/3 of the transcripts have a very close or identical BLAST hit in the other transcriptome.. but these hits are rarely full-length. There are actually few transcripts with a 100% full-length match in the other cultivar/genotype.

Doing differential expression analysis for different conditions in the same cultivar, no problem, I can do that with standard approaches. However, I'm not entirely sure how to do differential expression analyses across cultivars/genotype.

I know that I somehow need to produce a "combined" reference transcriptome, by one of two approaches:

(1) Simply throw all RNAseq reads from both cultivars into a new de novo assembly (which I'm doing now)

(2) Combine the two existing de novo assemblies into a new assembly using an OLC-based method like CAP3 or MIRA

To me, the main thing to avoid is assemblies containing "fake" transcripts that are half from one cultivar and half from another, and I can see approach (1) doing that a lot, because the de novo assembly breaks everything into kmers and you lose information about which full-length transcripts come from which cultivar/genotype. I am thinking that approach (2) is better for avoiding "fake" fusion transcripts since it starts from the point of long transcripts that are known to come from just one cultivar/genotype.

PS. Did I mention it's a horrible polyploid and there's no genome?

Does anyone have an opinion or similar experience?
Tags: cap3, denovo, rnaseq, transcriptome, trinity
dongilbert

Junior Member

Join Date: Jun 2012

Posts: 9
- Share
- Tweet
#2

03-13-2014, 07:45 PM

You don't say how you did your single cultivar assemblies that were short, but if it was Trinity, then add Velvet/Oases and SoapTrans and/or TransAbyss, all of which give you more complete assemblies if your input is paired end reads. Use multi-kmers up to size of reads, as that gives more complete assembly of the high expressed genes.

See here for software that picks your best gene subset of several transcript assemblies of the same data:

EvidentialGene

http://arthropods.eugenes.org/EvidentialGene/

see about/EvidentialGene_trassembly_pipe.html for the software.

This paper is an independently done comparison of methods, with essentially same conclusions, that combining best of several assembliers, using CDS-size metrics, gives you the most complete genes:

PLOS One

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0091776

The artifact gene-joins (fake fusions) are exacerbated using post-assembly mergers such as CAP and velvet/o -merge (maybe also mira, i've not tested that tho). In general the post-assembly mergers don't use all the read pair info and make more mistakes by joining things that don't belong.

You can use your cultivar mixed read set for another assembly, if above re-assembly with other assemblers doesn't help enough. The CDS-selection pipeline I've built throws out those mistakes as the CDS never spans gene joins (too many stop codons).

There are several tips here that work to improve mRNA assemblies

http://arthropods.eugenes.org/EvidentialGene/evigene/docs/perfect-mrna-assembly-2013jan.txt

For matching your 2 cultivars I suggest matching CDS also/instead as much of the assembly differences (artifacts, shortness) will be in UTRs. You may also want to measure expression differences only on CDS (or CDS +100bp)
to avoid those assembly artifacts.
Comment

Previous template Next

Nine Things a Sample Prep Scientist Thinks About Before Sequencing

by SEQadmin2

I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

Here are nine questions we think about, in roughly the order they matter, before...
- Channel: Articles
06-18-2026, 07:11 AM
From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data

by SEQadmin2

Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.

The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...
- Channel: Articles
06-02-2026, 10:05 AM

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 46 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 106 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Two related transcriptomes: merging but avoiding fake fusion transcripts

Comment

Latest Articles

ad_right_rmr

News