Hello,
I am a phd candidate to bioninformatics and with (almost) 0 guidance. Seeking help here.. I was asked to do a de novo RNA transcriptome assembly from a total RNA sequencing. After fastqc i trimmed my original fastq and then ran trinity. So i got my trinity_trimmed.fasta. So, some of the things i was asked to do are:
1) fill out a table like this one :
| total number | total length(nt) | mean length(nt) | N50 | total consensus sequences | Distinct Clusters | Distinct Singletons
Contig
______
Unigene
I used TrinityStats.pl and got this :
## Counts of transcripts, etc.
################################
Total trinity 'genes': 87177
Total trinity transcripts: 169974
Percent GC: 40.18
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3290
Contig N20: 2503
Contig N30: 2049
Contig N40: 1713
Contig N50: 1413
Median contig length: 529
Average contig: 869.67
Total assembled bases: 147821426
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3087
Contig N20: 2301
Contig N30: 1816
Contig N40: 1414
Contig N50: 1029
Median contig length: 348
Average contig: 632.11
Total assembled bases: 55105774
My question has 2 parts : a) can i fill out this table with this information? b) Some people use cap3 assembly tool. I have already done that too in case i need it. Is that the way to go ? I need to check the quality of trinity_trimmed.fasta ?
for cap3 i also used TrinityStats.pl and got this :
for contigs:
Total trinity 'genes': 23017
Total trinity transcripts: 23017
Percent GC: 40.42
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3885
Contig N20: 3082
Contig N30: 2598
Contig N40: 2254
Contig N50: 1971
Median contig length: 1318
Average contig: 1522.23
Total assembled bases: 35037102
- note: not reporting gene-based longest isoform info since couldn't parse Trinity accession info.
for singletons:
## Counts of transcripts, etc.
################################
Total trinity 'genes': 67695
Total trinity transcripts: 81478
Percent GC: 38.77
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 1906
Contig N20: 1347
Contig N30: 1007
Contig N40: 751
Contig N50: 572
Median contig length: 333
Average contig: 490.70
Total assembled bases: 39981353
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 1853
Contig N20: 1284
Contig N30: 917
Contig N40: 671
Contig N50: 508
Median contig length: 317
Average contig: 461.01
Total assembled bases: 31207973
2) blastp/blastx in excel files.
i should use -outfmt 16 ?
( also hmmscan/pfam is needed for KEGG / GO terms ? )
3) Do a KEGG and GO analysis. I should annotate the assembly ( but which one the trinity_trimmed.fasta or the cap3 one ? ) using Trinotate and then go with GOseq for GO? Or i could use blast2go, using the blastx/blatp files with -outfmt 16? (7 days trial version ) . Kegg also in blast2go or i could something llike this : https://www.kegg.jp/blastkoala/ ?
i know i was long, sorry about that.
I am a phd candidate to bioninformatics and with (almost) 0 guidance. Seeking help here.. I was asked to do a de novo RNA transcriptome assembly from a total RNA sequencing. After fastqc i trimmed my original fastq and then ran trinity. So i got my trinity_trimmed.fasta. So, some of the things i was asked to do are:
1) fill out a table like this one :
| total number | total length(nt) | mean length(nt) | N50 | total consensus sequences | Distinct Clusters | Distinct Singletons
Contig
______
Unigene
I used TrinityStats.pl and got this :
## Counts of transcripts, etc.
################################
Total trinity 'genes': 87177
Total trinity transcripts: 169974
Percent GC: 40.18
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3290
Contig N20: 2503
Contig N30: 2049
Contig N40: 1713
Contig N50: 1413
Median contig length: 529
Average contig: 869.67
Total assembled bases: 147821426
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3087
Contig N20: 2301
Contig N30: 1816
Contig N40: 1414
Contig N50: 1029
Median contig length: 348
Average contig: 632.11
Total assembled bases: 55105774
My question has 2 parts : a) can i fill out this table with this information? b) Some people use cap3 assembly tool. I have already done that too in case i need it. Is that the way to go ? I need to check the quality of trinity_trimmed.fasta ?
for cap3 i also used TrinityStats.pl and got this :
for contigs:
Total trinity 'genes': 23017
Total trinity transcripts: 23017
Percent GC: 40.42
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3885
Contig N20: 3082
Contig N30: 2598
Contig N40: 2254
Contig N50: 1971
Median contig length: 1318
Average contig: 1522.23
Total assembled bases: 35037102
- note: not reporting gene-based longest isoform info since couldn't parse Trinity accession info.
for singletons:
## Counts of transcripts, etc.
################################
Total trinity 'genes': 67695
Total trinity transcripts: 81478
Percent GC: 38.77
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 1906
Contig N20: 1347
Contig N30: 1007
Contig N40: 751
Contig N50: 572
Median contig length: 333
Average contig: 490.70
Total assembled bases: 39981353
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 1853
Contig N20: 1284
Contig N30: 917
Contig N40: 671
Contig N50: 508
Median contig length: 317
Average contig: 461.01
Total assembled bases: 31207973
2) blastp/blastx in excel files.
i should use -outfmt 16 ?
( also hmmscan/pfam is needed for KEGG / GO terms ? )
3) Do a KEGG and GO analysis. I should annotate the assembly ( but which one the trinity_trimmed.fasta or the cap3 one ? ) using Trinotate and then go with GOseq for GO? Or i could use blast2go, using the blastx/blatp files with -outfmt 16? (7 days trial version ) . Kegg also in blast2go or i could something llike this : https://www.kegg.jp/blastkoala/ ?
i know i was long, sorry about that.