RNA de novo assembly - blasts - KEGG - GO

polaxgr

Junior Member

Join Date: Mar 2018

Posts: 6
- Share
- Tweet
#1

RNA de novo assembly - blasts - KEGG - GO

03-18-2020, 02:58 AM

Hello,

I am a phd candidate to bioninformatics and with (almost) 0 guidance. Seeking help here.. I was asked to do a de novo RNA transcriptome assembly from a total RNA sequencing. After fastqc i trimmed my original fastq and then ran trinity. So i got my trinity_trimmed.fasta. So, some of the things i was asked to do are:

1) fill out a table like this one :

| total number | total length(nt) | mean length(nt) | N50 | total consensus sequences | Distinct Clusters | Distinct Singletons

Contig
______

Unigene

I used TrinityStats.pl and got this :

## Counts of transcripts, etc.
################################
Total trinity 'genes': 87177
Total trinity transcripts: 169974
Percent GC: 40.18

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3290
Contig N20: 2503
Contig N30: 2049
Contig N40: 1713
Contig N50: 1413

Median contig length: 529
Average contig: 869.67
Total assembled bases: 147821426

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 3087
Contig N20: 2301
Contig N30: 1816
Contig N40: 1414
Contig N50: 1029

Median contig length: 348
Average contig: 632.11
Total assembled bases: 55105774

My question has 2 parts : a) can i fill out this table with this information? b) Some people use cap3 assembly tool. I have already done that too in case i need it. Is that the way to go ? I need to check the quality of trinity_trimmed.fasta ?

for cap3 i also used TrinityStats.pl and got this :

for contigs:

Total trinity 'genes': 23017
Total trinity transcripts: 23017
Percent GC: 40.42

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3885
Contig N20: 3082
Contig N30: 2598
Contig N40: 2254
Contig N50: 1971

Median contig length: 1318
Average contig: 1522.23
Total assembled bases: 35037102

- note: not reporting gene-based longest isoform info since couldn't parse Trinity accession info.

for singletons:

## Counts of transcripts, etc.
################################
Total trinity 'genes': 67695
Total trinity transcripts: 81478
Percent GC: 38.77

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 1906
Contig N20: 1347
Contig N30: 1007
Contig N40: 751
Contig N50: 572

Median contig length: 333
Average contig: 490.70
Total assembled bases: 39981353

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 1853
Contig N20: 1284
Contig N30: 917
Contig N40: 671
Contig N50: 508

Median contig length: 317
Average contig: 461.01
Total assembled bases: 31207973

2) blastp/blastx in excel files.

i should use -outfmt 16 ?

( also hmmscan/pfam is needed for KEGG / GO terms ? )

3) Do a KEGG and GO analysis. I should annotate the assembly ( but which one the trinity_trimmed.fasta or the cap3 one ? ) using Trinotate and then go with GOseq for GO? Or i could use blast2go, using the blastx/blatp files with -outfmt 16? (7 days trial version ) . Kegg also in blast2go or i could something llike this : https://www.kegg.jp/blastkoala/ ?

i know i was long, sorry about that.
Tags: None

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

RNA de novo assembly - blasts - KEGG - GO

Latest Articles

ad_right_rmr

News