Seqanswers Leaderboard Ad

**westerman** · 09-05-2013, 09:08 AM

The best way is to go back to square one. Combine the reads from both populations and run Trinity on the combination.

**FroggyFlox** · 09-05-2013, 09:59 AM

That would be my best advice too.

As the assembly is based on the reads you're feeding to Trinity, it is probably more appropriate to combine all your reads and make a new assembly.

Also, if you're working on Drosophila, you might want to have a look at the Genome-guided pipeline offered by Trinity. I'm not working on Drosophila, but it seems to me that you could easily use a drosophila genome (well annonated) to improve your assembly.
Here's the link: http://trinityrnaseq.sourceforge.net...d_trinity.html

**Jeremy** · 09-06-2013, 12:16 AM

Yup, I would also combine the reads and re-do the trinity assembly.
If that isn't an option for some reason then you can try cdhit-est to group reads that are the same between both assemblies.

**rskr** · 09-06-2013, 05:12 AM

You might get lucky, but depending on how close your strains are you might also get an assembly that is worse than either of the two individual assemblies, if you use the pooled approach.

**Dampor** · 09-06-2013, 05:51 AM

Thank guys!
Last comment from rskr concerns me actually.

wouldn't be better if I can combine transcripts from the two pools by matching their translation, so use protein sequences?
What tools would allow me to do that?
Thanks again

**Jeremy** · 09-08-2013, 09:44 PM

While a combined assembly might be slightly worse at a few divergent loci, you will at least get an assembly that will allow you do some comparisons. Merging two independent assemblies will run into most of the same problems as producing a single assembly plus some extra ones.

If I was doing the same on a species with a poorly annotated genome I would try a merged assembly, try grouping the two assemblies using cd-hit est, (As for comparing translations of the two, you could a tblastx), then I would pick the method that gives the 'best' results.

But Drosophila is a fantastically annotated species, why not map against the genome using a gff file for gene locations?

**Dampor** · 09-09-2013, 12:14 AM

Hi Jeremy

Unfortunately there is not a sequenced genome for this species yet
The closest sequenced genome is D. pseudoobscura, which is already 9% divergent (just at coding sequences).
Thanks.
I will try with both ways then

**keithforest** · 09-20-2013, 06:30 AM

I've tried assembling combined reads vs. assembling separate reads for several species, and combining reads has always given a more contiguous and accurate assembly.

If you have > 200 million combined reads, I strongly recommend using Trinity In-Silico Normalization, which will give a good assembly in much shorter run time when compared to non-normalized assembly.

**Blahah404** · 09-21-2013, 01:17 PM

if you have two completed assemblies the easiest thing is to merge them with an overlap layout consensus assembly rather than go back to square one like others have suggested.

I would use cap3 after concatenating the two FASTA files together, and then use GapFiller with the full set of reads to see if you can improve contiguity any further. Redoing the whole assembly is more likely to lead to more assembly artefacts.

**westerman** · 09-23-2013, 08:45 AM

@Blahah404. I am not sure if your point applies to RNAseq (aka transcriptome) projects. For genome projects, sure, combine assemblies in order to increase contig length; ideally you would end up with chromosome size contigs. But for RNAseq projects we generally have enough read depth to make full length contigs. What we lack are the rare transcripts and alternative splicing. For that we need as many reads as possible so that the data is not lost in the noise. Thus combining read sets is a good idea.

To put it in very simple terms. Assume that there is a rare transcript that is expressed once in sample A and once in sample B. The assembly process might very well throw away that rare transcript because it is indistinguishable from noise (i.e., spurious machine-error reads that are found only once). However combine the two data sets and that rare transcript will be found twice thus bringing out out from the noise.

@Dampor. You could use the protein translations to combine data sets. It is probably superior to combing nucleotide assemblies. But as above you may lose the 'power' to resolve low-expression transcripts and may not be able to determine alternative splicing.

**Blahah404** · 09-23-2013, 11:26 AM

@westerman yes, I'm talking about de-novo transcriptome assemblies. In most cases assemblers produce full-length contigs for only a fraction (~40-60%) of transcripts that are represented, at least in our tests with plant species (we've got ~960 species sequenced for 1KP). Fragmentation is a problem, and post-assembly OLC and gapfilling improves quantification accuracy as well as the ability to analyse UTRs.

I agree with you that alternative splicing and low-abundance transcript information could be lost with the strategy I suggested - whether that matters depends on the purpose of the assembly. But I disagree that a crude pooling of the reads is the best strategy. By pooling a larger set of reads you also pool the errors - doubling sequencing depth increases the number of true positive assemblies up to a point, and increases false positives too (e.g. false chimeras and false-bubble isoforms). You can't distinguish novel, low abundance isoforms from high-abundance errors.

An intermediate strategy with the benefits of both would be to do the OLC + gapfiller merge as I suggested, then to pool the reads and filter out pairs mapping concordantly to the merged contigs. That leaves you with the set of reads that was not included in the original assembly. You could then do a second merged assembly of the contigs with the unused reads, preserving the contigs you've already assembled and harnessing any pooled gain in abundance for transcripts that were too low abundance in either sample to assemble the first time round.

edit: of course, if you're doing a reference-guided assembly, or a de-novo assembly for a species with a reference genome, you have the luxury of not worrying too much about artefacts because you can identify them using the genome. My comments only apply for de-novo assembly of a species with no available genome sequence.

**westerman** · 09-24-2013, 07:55 AM

@blahah404: 960 species is an order of magnitude more than I have seen come through our lab so I will defer to your experience. I'll give your Cap3/GapFiller method a try sometime and see how it compares to a full denovo assembly.

**Dampor** · 09-24-2013, 08:03 AM

Thank you guys,

I did actually decide to try both approaches

My result will follow in this thread.

@westerman, also, what tool will allow me to combine data sets using protein translations?

Cheers

**westerman** · 09-24-2013, 08:05 AM

Originally posted by Dampor View Post

Thank you guys,
@westerman, also, what tool will allow me to combine data sets using protein translations?

Not sure. I've never tried it. But it seemed like an interesting approach.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 33 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

merge two Trinity transcriptome assemblies into one

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News