Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA de novo assembly - blasts - KEGG - GO

    Hello,

    I am a phd candidate to bioninformatics and with (almost) 0 guidance. Seeking help here.. I was asked to do a de novo RNA transcriptome assembly from a total RNA sequencing. After fastqc i trimmed my original fastq and then ran trinity. So i got my trinity_trimmed.fasta. So, some of the things i was asked to do are:

    1) fill out a table like this one :

    | total number | total length(nt) | mean length(nt) | N50 | total consensus sequences | Distinct Clusters | Distinct Singletons

    Contig
    ______

    Unigene

    I used TrinityStats.pl and got this :

    ## Counts of transcripts, etc.
    ################################
    Total trinity 'genes': 87177
    Total trinity transcripts: 169974
    Percent GC: 40.18

    ########################################
    Stats based on ALL transcript contigs:
    ########################################

    Contig N10: 3290
    Contig N20: 2503
    Contig N30: 2049
    Contig N40: 1713
    Contig N50: 1413

    Median contig length: 529
    Average contig: 869.67
    Total assembled bases: 147821426

    #####################################################
    ## Stats based on ONLY LONGEST ISOFORM per 'GENE':
    #####################################################

    Contig N10: 3087
    Contig N20: 2301
    Contig N30: 1816
    Contig N40: 1414
    Contig N50: 1029

    Median contig length: 348
    Average contig: 632.11
    Total assembled bases: 55105774

    My question has 2 parts : a) can i fill out this table with this information? b) Some people use cap3 assembly tool. I have already done that too in case i need it. Is that the way to go ? I need to check the quality of trinity_trimmed.fasta ?

    for cap3 i also used TrinityStats.pl and got this :

    for contigs:

    Total trinity 'genes': 23017
    Total trinity transcripts: 23017
    Percent GC: 40.42

    ########################################
    Stats based on ALL transcript contigs:
    ########################################

    Contig N10: 3885
    Contig N20: 3082
    Contig N30: 2598
    Contig N40: 2254
    Contig N50: 1971

    Median contig length: 1318
    Average contig: 1522.23
    Total assembled bases: 35037102

    - note: not reporting gene-based longest isoform info since couldn't parse Trinity accession info.

    for singletons:

    ## Counts of transcripts, etc.
    ################################
    Total trinity 'genes': 67695
    Total trinity transcripts: 81478
    Percent GC: 38.77

    ########################################
    Stats based on ALL transcript contigs:
    ########################################

    Contig N10: 1906
    Contig N20: 1347
    Contig N30: 1007
    Contig N40: 751
    Contig N50: 572

    Median contig length: 333
    Average contig: 490.70
    Total assembled bases: 39981353

    #####################################################
    ## Stats based on ONLY LONGEST ISOFORM per 'GENE':
    #####################################################

    Contig N10: 1853
    Contig N20: 1284
    Contig N30: 917
    Contig N40: 671
    Contig N50: 508

    Median contig length: 317
    Average contig: 461.01
    Total assembled bases: 31207973


    2) blastp/blastx in excel files.

    i should use -outfmt 16 ?

    ( also hmmscan/pfam is needed for KEGG / GO terms ? )

    3) Do a KEGG and GO analysis. I should annotate the assembly ( but which one the trinity_trimmed.fasta or the cap3 one ? ) using Trinotate and then go with GOseq for GO? Or i could use blast2go, using the blastx/blatp files with -outfmt 16? (7 days trial version ) . Kegg also in blast2go or i could something llike this : https://www.kegg.jp/blastkoala/ ?

    i know i was long, sorry about that.

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
22 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
47 views
0 likes
Last Post seqadmin  
Working...
X