SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Compare de-novo transcriptome assembly to genome reference guided assembly IdoBar Bioinformatics 1 04-04-2014 01:28 AM
Mining de-novo transcriptome assembly for transcripts missing in deletion mutant gumbos Bioinformatics 0 11-28-2012 03:03 PM
De novo assembly of transcripts originating from specific target regions rboettcher Bioinformatics 5 10-23-2012 10:19 AM
De novo assembly of highly expressed transcripts foryvonne RNA Sequencing 12 05-04-2011 03:47 AM
Pre-assembly for short-reads to minimize RAM usage Alex8 Bioinformatics 6 11-05-2010 06:58 AM

Reply
 
Thread Tools
Old 04-11-2014, 03:20 AM   #1
likebiology
Member
 
Location: haifa, Israel

Join Date: Aug 2013
Posts: 14
Default how to minimize transcripts while de novo assembly transcriptome

Hi all,

I am using Trinity for de novo assembly of mRNA from Illumina Hiseq2000, 100bp pair ends. We get more than 11M clean reads of each sample.and totally 139M clean reads( about 28G clean data). The reads quality is good and all of the bases are above 30, using FastQC.

I did de novo assembly using Trinity with default parameters and got 920k ​transcripts( ​201bp-24477bp; and if we calculate ​​the traanscripts from 500bp ---24477bp​, we will get 350k transcripts​), After that, I filtered it with CD-HIT​-EST​ and 789k clusters left. ​There are still too many​ transcripts​.​
​The clusters ranges from 201bp to 24477bp, While N50 is 2549bp.


​I divided these data to two groups, 57.5M and 81M,respectively. I assembled again with the 57.5M reads, This time I get 518387 transcript, 358886 components. The longest is 17339, N50 is 2900. After I filted it by CD-HIT-EST, the Trinity transcript is 443765, components is 358695, N50 is 1618, which is shorter than before.

Does anybody have any suggestion on how to minimize the number of transcript?

​Is it OK that N50 become shorter after filtering(2900bp to 1618bp)​
likebiology is offline   Reply With Quote
Old 04-11-2014, 04:47 AM   #2
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

We have filtered transcripts by the length of the longest ORF. Pasa also try to reduce the number of the transcripts.
TiborNagy is offline   Reply With Quote
Old 04-11-2014, 07:37 AM   #3
yueluo
Member
 
Location: Guangzhou China

Join Date: Aug 2013
Posts: 82
Default

Here are some steps I use:
1. Perform a digital-normalization
2. Set --min_mer_cov to >= 2
3. Use --triplet_lock or --extended_lock for Butterfly reconstruction

Then use CD-HIT-EST or Tgicl to cluster transcripts. Or as TiborNagy mentioned, do a CDS/ORF prediction and pick the longest.

There could be some other methods, but I only have experience with the ones above.
yueluo is offline   Reply With Quote
Old 04-11-2014, 10:02 AM   #4
likebiology
Member
 
Location: haifa, Israel

Join Date: Aug 2013
Posts: 14
Default

Quote:
Originally Posted by TiborNagy View Post
We have filtered transcripts by the length of the longest ORF. Pasa also try to reduce the number of the transcripts.
Thank you TiborNagy! As I am newbie in NGS data analyzing, so could you explain to me in detail how to filter transcripts by the length of the longest ORF? I will reads PASA and follow the instructions.

Thank you!
likebiology is offline   Reply With Quote
Old 04-11-2014, 10:16 AM   #5
likebiology
Member
 
Location: haifa, Israel

Join Date: Aug 2013
Posts: 14
Thumbs up

Quote:
Originally Posted by yueluo View Post
Here are some steps I use:
1. Perform a digital-normalization
2. Set --min_mer_cov to >= 2
3. Use --triplet_lock or --extended_lock for Butterfly reconstruction

Then use CD-HIT-EST or Tgicl to cluster transcripts. Or as TiborNagy mentioned, do a CDS/ORF prediction and pick the longest.

There could be some other methods, but I only have experience with the ones above.
I thank you very much yueluo for your sharing your experience!

As to --triplet_lock and --extended_lock, which is better? or do you recommend I use both of these two parameters at the same time?

I will add --min_mer_cov to >= 2 and run trinity again.

AS to the "Perform a digital-normalization", could you show me in detail? Thanks!

Best,
likebiology is offline   Reply With Quote
Old 04-11-2014, 10:26 AM   #6
yueluo
Member
 
Location: Guangzhou China

Join Date: Aug 2013
Posts: 82
Default

--triplet_lock and --extended_lock are two stringency levels in Butterfly reconstruction . --extended_lock is higher so it produces fewer transcripts. You can use either one.

Digital normalization is included in the Trinity package, this is from its website:
http://trinityrnaseq.sourceforge.net...alization.html
yueluo is offline   Reply With Quote
Old 04-11-2014, 10:30 AM   #7
likebiology
Member
 
Location: haifa, Israel

Join Date: Aug 2013
Posts: 14
Default

Quote:
Originally Posted by yueluo View Post
--triplet_lock and --extended_lock are two stringency levels in Butterfly reconstruction . --extended_lock is higher so it produces fewer transcripts. You can use either one.

Digital normalization is included in the Trinity package, this is from its website:
http://trinityrnaseq.sourceforge.net...alization.html
Many thanks yueluo indeed! I will try and hope this time I will get smaller transcripts.

Best,
likebiology is offline   Reply With Quote
Old 04-13-2014, 11:03 PM   #8
likebiology
Member
 
Location: haifa, Israel

Join Date: Aug 2013
Posts: 14
Default

I assemble the transcriptome again and this time I did it following yueluo's suggestion:

First, I did a digital-normalization with trinity with such commands:
util/normalize_by_kmer_coverage.pl --seqType fq --JM 80G --max_cov 30 --min_kmer_cov 2 --left_list R1.list --right_list R2.list --pairs_together --PARALLEL_STATS --JELLY_CPU 10

Use such commands for assembly:
Trinity.pl --seqType fq --JM 80G --min_kmer_cov 2 --extended_lock --left R1.list.normalized_K25_C30_pctSD200.fq --right R2.list.normalized_K25_C30_pctSD200.fq --CPU 10
Then I get 467208 transcripts, 324216 components, while N50 is 1578.

After this, I did a cluster filtering use CD-HIT-EST, the results are below:
Total trinity transcripts: 413673
Total trinity components: 324161
Percent GC: 48.94

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3891
Contig N20: 2679
Contig N30: 1930
Contig N40: 1371
Contig N50: 970

Median contig length: 349
Average contig: 638.18
Total assembled bases: 263995964
Now the transcripts are still too many, Anybody have some suggestions to further reduce the number of transcripts? Thank you!
likebiology is offline   Reply With Quote
Old 04-14-2014, 12:17 AM   #9
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

What is it you’re trying to do with them that makes this too many? Trinity has good tools for differential expression that will work at the gene level with these numbers of transcripts.

To try to answer your question though, about the only other thought I had was to try to expand the gene level classification through Trinity’s Trinotate pipeline (http://trinotate.sourceforge.net). Maybe things that hit the same orthologue, you can collapse further.... If you only care about protein coding genes, you could also just get rid of anything that doesn’t hit an orthologue or doesn’t hit a vertebrate protein (assuming you’re working on vertebrates).

But de novo transcriptome is just kinda messy and this is what everyone deals with when doing it. I don’t think there are many other options than what’s listed here that will work for you unless you have reference genome to align the transcripts back to.
Wallysb01 is offline   Reply With Quote
Old 04-15-2014, 11:00 PM   #10
likebiology
Member
 
Location: haifa, Israel

Join Date: Aug 2013
Posts: 14
Default

Quote:
Originally Posted by Wallysb01 View Post
What is it youíre trying to do with them that makes this too many? Trinity has good tools for differential expression that will work at the gene level with these numbers of transcripts.

To try to answer your question though, about the only other thought I had was to try to expand the gene level classification through Trinityís Trinotate pipeline (http://trinotate.sourceforge.net). Maybe things that hit the same orthologue, you can collapse further.... If you only care about protein coding genes, you could also just get rid of anything that doesnít hit an orthologue or doesnít hit a vertebrate protein (assuming youíre working on vertebrates).

But de novo transcriptome is just kinda messy and this is what everyone deals with when doing it. I donít think there are many other options than whatís listed here that will work for you unless you have reference genome to align the transcripts back to.
Hi Wallysb01, My previous expected thought is to download the database of nr, swissprot,KEGG, COG and blast my transcript against these them. now in the trinotate pepeline, it seems to recommend Swiss-Prot and pfam, Do you think this is enough? I am confused about the following steps. Thanks!
likebiology is offline   Reply With Quote
Reply

Tags
contigs, transcriptome assembly, trinity

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:42 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO