Seqanswers Leaderboard Ad

**TiborNagy** · 04-11-2014, 03:47 AM

We have filtered transcripts by the length of the longest ORF. Pasa also try to reduce the number of the transcripts.

**yueluo** · 04-11-2014, 06:37 AM

Here are some steps I use:
1. Perform a digital-normalization
2. Set --min_mer_cov to >= 2
3. Use --triplet_lock or --extended_lock for Butterfly reconstruction

Then use CD-HIT-EST or Tgicl to cluster transcripts. Or as TiborNagy mentioned, do a CDS/ORF prediction and pick the longest.

There could be some other methods, but I only have experience with the ones above.

**likebiology** · 04-11-2014, 09:02 AM

Originally posted by TiborNagy View Post

We have filtered transcripts by the length of the longest ORF. Pasa also try to reduce the number of the transcripts.

Thank you TiborNagy! As I am newbie in NGS data analyzing, so could you explain to me in detail how to filter transcripts by the length of the longest ORF? I will reads PASA and follow the instructions.

Thank you!

**likebiology** · 04-11-2014, 09:16 AM

Originally posted by yueluo View Post

Here are some steps I use:
1. Perform a digital-normalization
2. Set --min_mer_cov to >= 2
3. Use --triplet_lock or --extended_lock for Butterfly reconstruction

Then use CD-HIT-EST or Tgicl to cluster transcripts. Or as TiborNagy mentioned, do a CDS/ORF prediction and pick the longest.

There could be some other methods, but I only have experience with the ones above.

I thank you very much yueluo for your sharing your experience!

As to --triplet_lock and --extended_lock, which is better? or do you recommend I use both of these two parameters at the same time?

I will add --min_mer_cov to >= 2 and run trinity again.

AS to the "Perform a digital-normalization", could you show me in detail? Thanks!

Best,

**yueluo** · 04-11-2014, 09:26 AM

--triplet_lock and --extended_lock are two stringency levels in Butterfly reconstruction . --extended_lock is higher so it produces fewer transcripts. You can use either one.

Digital normalization is included in the Trinity package, this is from its website:

Encountered a 404 error

http://trinityrnaseq.sourceforge.net/trinity_insilico_normalization.html

**likebiology** · 04-11-2014, 09:30 AM

Originally posted by yueluo View Post

--triplet_lock and --extended_lock are two stringency levels in Butterfly reconstruction . --extended_lock is higher so it produces fewer transcripts. You can use either one.

Digital normalization is included in the Trinity package, this is from its website:
http://trinityrnaseq.sourceforge.net...alization.html

Many thanks yueluo indeed! I will try and hope this time I will get smaller transcripts.

Best,

**likebiology** · 04-13-2014, 10:03 PM

I assemble the transcriptome again and this time I did it following yueluo's suggestion:

First, I did a digital-normalization with trinity with such commands:
util/normalize_by_kmer_coverage.pl --seqType fq --JM 80G --max_cov 30 --min_kmer_cov 2 --left_list R1.list --right_list R2.list --pairs_together --PARALLEL_STATS --JELLY_CPU 10

Use such commands for assembly:
Trinity.pl --seqType fq --JM 80G --min_kmer_cov 2 --extended_lock --left R1.list.normalized_K25_C30_pctSD200.fq --right R2.list.normalized_K25_C30_pctSD200.fq --CPU 10
Then I get 467208 transcripts, 324216 components, while N50 is 1578.

After this, I did a cluster filtering use CD-HIT-EST, the results are below:
Total trinity transcripts: 413673
Total trinity components: 324161
Percent GC: 48.94

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3891
Contig N20: 2679
Contig N30: 1930
Contig N40: 1371
Contig N50: 970

Median contig length: 349
Average contig: 638.18
Total assembled bases: 263995964
Now the transcripts are still too many, Anybody have some suggestions to further reduce the number of transcripts? Thank you!

**Wallysb01** · 04-13-2014, 11:17 PM

What is it you’re trying to do with them that makes this too many? Trinity has good tools for differential expression that will work at the gene level with these numbers of transcripts.

To try to answer your question though, about the only other thought I had was to try to expand the gene level classification through Trinity’s Trinotate pipeline (http://trinotate.sourceforge.net). Maybe things that hit the same orthologue, you can collapse further.... If you only care about protein coding genes, you could also just get rid of anything that doesn’t hit an orthologue or doesn’t hit a vertebrate protein (assuming you’re working on vertebrates).

But de novo transcriptome is just kinda messy and this is what everyone deals with when doing it. I don’t think there are many other options than what’s listed here that will work for you unless you have reference genome to align the transcripts back to.

**likebiology** · 04-15-2014, 10:00 PM

Originally posted by Wallysb01 View Post

What is it you’re trying to do with them that makes this too many? Trinity has good tools for differential expression that will work at the gene level with these numbers of transcripts.

To try to answer your question though, about the only other thought I had was to try to expand the gene level classification through Trinity’s Trinotate pipeline (http://trinotate.sourceforge.net). Maybe things that hit the same orthologue, you can collapse further.... If you only care about protein coding genes, you could also just get rid of anything that doesn’t hit an orthologue or doesn’t hit a vertebrate protein (assuming you’re working on vertebrates).

But de novo transcriptome is just kinda messy and this is what everyone deals with when doing it. I don’t think there are many other options than what’s listed here that will work for you unless you have reference genome to align the transcripts back to.

Hi Wallysb01, My previous expected thought is to download the database of nr, swissprot,KEGG, COG and blast my transcript against these them. now in the trinotate pepeline, it seems to recommend Swiss-Prot and pfam, Do you think this is enough? I am confused about the following steps. Thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 17 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

how to minimize transcripts while de novo assembly transcriptome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News