SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
De novo assembly using Trinity ankitarathore RNA Sequencing 5 10-28-2014 08:03 AM
de novo transcriptome Trinity Assembly concatenamers Dampor Bioinformatics 2 01-14-2014 01:34 AM
de novo assembly using Trinity versus Velvet-Oases Nol De novo discovery 8 10-26-2013 11:56 AM
get wrong when I do de novo assembly by trinity for two 22.6G reads file taoxiang180 Bioinformatics 4 08-20-2012 09:52 PM

Reply
 
Thread Tools
Old 12-20-2015, 09:33 PM   #1
huan
Member
 
Location: China

Join Date: Oct 2010
Posts: 55
Default De novo RNA-Seq Assembly using Trinity gives too many unigenes.

Hi,
I have done the De novo RNA-Seq Assembly using Trinity with my fish data, but the number of unigenes I get is 333,843, and the N50 Length is 396. I think the number of unigenes is too many. So which parameter should I set to reduce the number of unigenes and make the N50 length longer?
My trinity version is: Trinityrnaseq_r20131110
All of the parameter is default.
Thanks a lot for any suggesstion!
__________________
happy
huan is offline   Reply With Quote
Old 12-21-2015, 08:25 AM   #2
arthurmelo
Member
 
Location: Durham, NH, US

Join Date: Jul 2012
Posts: 19
Default

Hi, assuming you first parsed your data based on quality analysis using a Phred score Q>20, you certain could set the minimum length for acceptable contigs to 500 bp, for example, using --min_contig_length 500. If you do not filter your reads based on quality, I encourage you use Trimmomatic for example in order to eliminate the sequencing errors which could affect the Bruijn graphs resolution, creating several number of unresolved bubbles.
arthurmelo is offline   Reply With Quote
Old 12-21-2015, 05:04 PM   #3
huan
Member
 
Location: China

Join Date: Oct 2010
Posts: 55
Default

Thanks a lot @arthurmelo.
We have filted the low quality reads by deleting the percentage of Phred score Q>20 less than 80%. So As you suggest, we'd better fillter the reads again by Trimmomatic. I'll have a try. Thanks a lot!
As the set of the minimum length for acceptable contigs to 500 bp will loose too many reads, I'll try when having no other choice.
BTW, is there any other way to reduce the number of the unigenes? OR is there any other way to fillter the data?
__________________
happy

Last edited by huan; 12-22-2015 at 04:31 PM.
huan is offline   Reply With Quote
Old 12-22-2015, 11:06 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,142
Default

Quote:
Originally Posted by huan View Post
Thanks a lot @arthurmelo.
We have filted the low quality reads by deleting the percentage of Phred score Q>20 less than 80%. So As you suggest, we'd better fillter the reads again by Trimmomatic. I'll have a try. Thanks a lot!
As the set of the minimum length for acceptable contigs to 500 bp will lose too many reads, I'll try when having choice.
BTW, is there any other way to reduce the number of the unigenes? OR is there any other way to fillter the data?
Having ~300K contigs as the initial, raw output of Trinity isn't terribly surprising. Trinity includes some scripts for filtering this initial contig file; the most useful filtering rule is removing contigs which only include a very small number of reads supporting them. To do this you first use the analysis tools in the Trinity package to align your read data back to the assembled contigs and then RSEM to calculate the relative abundance of reads for each contig using the script 'align_and_estimate_abundance.pl' (found in the Trinity /util folder). Finally you run the script 'filter_fasta_by_rsem_values.pl' (also in the /util folder). Filtering out contigs with very low read can have a dramatic impact on the total number.

Also, in my opinion, 500bp is much too high a threshold contig length. I routinely use 200bp.
kmcarr is offline   Reply With Quote
Old 12-22-2015, 10:16 PM   #5
huan
Member
 
Location: China

Join Date: Oct 2010
Posts: 55
Default

Thanks a lot @kmcarr.
That's quite a good idea. But I am not sure whether the "removing contigs" means 'removing inchworm result contigs with --no_run_butterfly --no_run_quantifygraph' or not.
Thanks a lot again.
__________________
happy
huan is offline   Reply With Quote
Old 12-23-2015, 08:29 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,142
Default

Quote:
Originally Posted by huan View Post
Thanks a lot @kmcarr.
That's quite a good idea. But I am not sure whether the "removing contigs" means 'removing inchworm result contigs with --no_run_butterfly --no_run_quantifygraph' or not.
Thanks a lot again.
The contig filtering I am describing takes place after the Trinity assembly is complete, not during the assembly phase.

The basic process is to use the align_and_estimate_abundance.pl script found in the Trinity util/ folder to align your input reads to the 333K contigs in your final Trinity.fasta file. It also creates abundance estimates (fpkm) using RSEM.

Following this step you use the filter_fasta_by_rsem_values.pl script (also in util/) filter out contigs with low fpkm (threshold value is adjustable). The contigs you are removing are from the final set in the Trinity.fasta file. The output will be a new, smaller file (e.g. Trinity_filtered.fasta).
kmcarr is offline   Reply With Quote
Old 12-23-2015, 04:59 PM   #7
huan
Member
 
Location: China

Join Date: Oct 2010
Posts: 55
Default

I got it. Thanks kmcarr.
But I wonder I will lose a lot low expressed transcript if I filter after the Trinity assembly is complete. Does it influence my analysis? BTW, if I filter before assemble, then the low expressed transcript can still be left. Will it give a better result?
__________________
happy
huan is offline   Reply With Quote
Old 12-24-2015, 11:40 AM   #8
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,142
Default

Quote:
Originally Posted by huan View Post
I got it. Thanks kmcarr.
But I wonder I will lose a lot low expressed transcript if I filter after the Trinity assembly is complete. Does it influence my analysis? BTW, if I filter before assemble, then the low expressed transcript can still be left. Will it give a better result?
When you say "filter before assembly" do you mean performing quality/adapter trrimming and filtering of your raw reads? You should pretty much always do that for any data and any downstream analysis.

Even so it is still a good idea to perform the abundance filtering I described post assembly. What we are talking about is removing contigs with virtually no support in the underlying data, only 1-2 FPKM. Think about that for a second. These "contigs" are supported by 1 out of 1,000,000 reads. Is that a genuine, meaningful transcript? Probably not. More likely it is sequencing error masquerading as a unique transcript.

Researchers often are far too worried about removing what is essentially noise in their data. Demanding that the bioinformatician KEEP EVERYTHING, fearing they are going to miss some big, important discovery. In reality it just confounds the analysis.
kmcarr is offline   Reply With Quote
Old 12-27-2015, 10:00 PM   #9
huan
Member
 
Location: China

Join Date: Oct 2010
Posts: 55
Default

Thanks a lot kmcarr. Your answer really works great.
__________________
happy
huan is offline   Reply With Quote
Reply

Tags
assembly, de novo, short contigs, trinity

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO