View Single Post
Old 03-18-2020, 02:58 AM   #1
polaxgr
Junior Member
 
Location: Athens

Join Date: Mar 2018
Posts: 6
Default RNA de novo assembly - blasts - KEGG - GO

Hello,

I am a phd candidate to bioninformatics and with (almost) 0 guidance. Seeking help here.. I was asked to do a de novo RNA transcriptome assembly from a total RNA sequencing. After fastqc i trimmed my original fastq and then ran trinity. So i got my trinity_trimmed.fasta. So, some of the things i was asked to do are:

1) fill out a table like this one :

| total number | total length(nt) | mean length(nt) | N50 | total consensus sequences | Distinct Clusters | Distinct Singletons

Contig
______

Unigene

I used TrinityStats.pl and got this :

## Counts of transcripts, etc.
################################
Total trinity 'genes': 87177
Total trinity transcripts: 169974
Percent GC: 40.18

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3290
Contig N20: 2503
Contig N30: 2049
Contig N40: 1713
Contig N50: 1413

Median contig length: 529
Average contig: 869.67
Total assembled bases: 147821426

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 3087
Contig N20: 2301
Contig N30: 1816
Contig N40: 1414
Contig N50: 1029

Median contig length: 348
Average contig: 632.11
Total assembled bases: 55105774

My question has 2 parts : a) can i fill out this table with this information? b) Some people use cap3 assembly tool. I have already done that too in case i need it. Is that the way to go ? I need to check the quality of trinity_trimmed.fasta ?

for cap3 i also used TrinityStats.pl and got this :

for contigs:

Total trinity 'genes': 23017
Total trinity transcripts: 23017
Percent GC: 40.42

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3885
Contig N20: 3082
Contig N30: 2598
Contig N40: 2254
Contig N50: 1971

Median contig length: 1318
Average contig: 1522.23
Total assembled bases: 35037102

- note: not reporting gene-based longest isoform info since couldn't parse Trinity accession info.

for singletons:

## Counts of transcripts, etc.
################################
Total trinity 'genes': 67695
Total trinity transcripts: 81478
Percent GC: 38.77

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 1906
Contig N20: 1347
Contig N30: 1007
Contig N40: 751
Contig N50: 572

Median contig length: 333
Average contig: 490.70
Total assembled bases: 39981353

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 1853
Contig N20: 1284
Contig N30: 917
Contig N40: 671
Contig N50: 508

Median contig length: 317
Average contig: 461.01
Total assembled bases: 31207973


2) blastp/blastx in excel files.

i should use -outfmt 16 ?

( also hmmscan/pfam is needed for KEGG / GO terms ? )

3) Do a KEGG and GO analysis. I should annotate the assembly ( but which one the trinity_trimmed.fasta or the cap3 one ? ) using Trinotate and then go with GOseq for GO? Or i could use blast2go, using the blastx/blatp files with -outfmt 16? (7 days trial version ) . Kegg also in blast2go or i could something llike this : https://www.kegg.jp/blastkoala/ ?

i know i was long, sorry about that.
polaxgr is offline   Reply With Quote