Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trinity de novo Assemblies Only 50% Accurate-Help please!

    Hi,

    I am assembling the transcriptome of a dessert rodent, the lesser egyptian jerboa, using Trinity. I have 2 de novo Trinity assemblies but only 50% of the PE reads used for assembly map back to the same transcript. I am writing to seek help on how to improve the accuracy of my assembly.

    Here are the details of my library preps and Trinity assembly:

    1. Libraries were prepared with illumina strand specific mRNA protocol and RNA RINs were 8-9. We ran PE100 runs on 3 different tissue types (in duplicates=6 different indexed libraries) pooled in a single lane; this yielded a total of 372 million reads (~50-70 million reads per sample). The raw data was of quite good quality (average Phred >32).


    2. Trimmomatic was used to clip illumina adapters/indexes on the reads and to check read qualities. I used the PAIRED-END mode of Trimmomatic with these parameters:

    ILLUMINACLIP:/opt/biotools/trimmomatic/adapters/TruSeq3-PE-2.fa:2:30:12:1:true LEADING:30 TRAILING:20 SLIDINGWINDOW:10:25 MINLEN:75

    After clipping and quality checks I had 322.6 million PE reads with average Phred of 39-40, sequence lengths 75-150bp and 51% GC content across reads.


    3. I then performed Trinity assembly with Trinity version trinityrnaseq_r20140717 like so:

    --seqType fq --SS_lib_type FR --left JacPE100.R1.fq --right JacPE100.R2.fq --CPU 16 --JM 350G


    I used default Trinity Settings:-

    Inchworm step:
    Kmer length set to: 25
    Min assembly length set to: 25
    Monitor turned on, set to: 1

    Chrysalis step:
-min_contig_length 200*
    -min_glue 2*
    -glue_factor 0.05*
    -min_iso_ratio 0.05*
    -t 16*
    -k 24*
    -kk 48*

    During this assembly, Trinity identified ~964.5 million unique KMERs. After assembly, I mapped back all the PE reads to my assembled transcriptome as described here-
    http://tinyurl.com/zearrod. The Assembly Stats are as follows:

    ################################
    ## Counts of transcripts, etc.
    ################################
    Total trinity 'genes': 306290
    Total trinity transcripts: 369174
    Percent GC: 48.47

    ########################################
    Stats based on ALL transcript contigs:
    ########################################

    Contig N10: 5172
    Contig N20: 3581
    Contig N30: 2517
    Contig N40: 1717
    Contig N50: 1095

    Median contig length: 333
    Average contig: 659.77
    Total assembled bases: 243568232

    #####################################################
    ## Stats based on ONLY LONGEST ISOFORM per 'GENE':
    #####################################################

    Contig N10: 4276
    Contig N20: 2665
    Contig N30: 1655
    Contig N40: 1006
    Contig N50: 664

    Median contig length: 313
    Average contig: 540.34
    Total assembled bases: 165499422

    Bowtie_Alignment Stats:

    [adi@comet-ln3 adi]$ cat slurm-1363147.out
    Thu Dec 10 20:51:34 PST 2015
    [1,385,700,000] lines read

    #read_type count pct
    proper_pairs 3 18231920 52.52
    improper_pairs 279823240 46.18
    right_only 4403056 0.73
    left_only 3523357 0.58

    Total aligned reads: 605981573


    4. I normalised my PE reads (100x) to test if it improves my assembly like so-

    /opt/biotools/trinity/util/insilico_read_normalization.pl --seqType fq --SS_lib_type FR --left JacPE100.R1.fq --right JacPE100.R2.fq --CPU 24 --JM 400G --max_cov 100 --pairs_together --PARALLEL_STATS*

    This reduced the number down to 28 million (from 322million)* PE reads. All were still of great quality like before. I ran Trinity assembly again on this normalised data set using the default parameters like before. Trinity identified 373.72 million unique KMERs this time. The stats for this assembly are here:

    ################################
    ## Counts of transcripts, etc.
    ################################
    Total trinity 'genes': 260331
    Total trinity transcripts: 337680
    Percent GC: 48.94

    ########################################
    Stats based on ALL transcript contigs:
    ########################################

    Contig N10: 5605
    Contig N20: 4089
    Contig N30: 3133
    Contig N40: 2384
    Contig N50: 1731

    Median contig length: 363
    Average contig: 812.09
    Total assembled bases: 274224929


    #####################################################
    ## Stats based on ONLY LONGEST ISOFORM per 'GENE':
    #####################################################

    Contig N10: 4655
    Contig N20: 3047
    Contig N30: 1994
    Contig N40: 1209
    Contig N50: 757

    Median contig length: 318
    Average contig: 572.85
    Total assembled bases: 149130461

    Bowtie_Alignment Stats:

    [adi@comet-ln3 adi]$ cat slurm-1402391.out
    Wed Dec 23 15:48:53 PST 2015
    [124,600,000] lines read

    #read_type count pct
    proper_pairs 24156938 49.22
    improper_pairs 23886180 48.67
    right_only 589379 1.20
    left_only 448786 0.91

    Total aligned reads: 49081283


    It appears that ~50% of my assemblies are false and I understand that typically Trinity has 70-80% proper-pairs mapping back to the assembly. Wonder if we may have to tweak trinity parameters to improve assemblies? I Shall be most grateful for any help and advice you could provide us with!

    Wish you a very happy 2016!


  • #2
    FYI.

    Aditya also posted the question to the Trinity mailing list. Brian Haas' (one of Trinity's authors) reply was as follows:
    The improper pairs count just indicates that the two reads of a pair are mapping to two different contigs and this is just because the corresponding transcripts are fragmented. We should probably use a different term to describe this.

    It appears you have a lot of redundancy in your sequencing given the high amount of normalization. You might take a look at what those extremely highly expressed transcripts are. If it's rRNA, then you might question your library construction approach. The only thing I'd recommend to do here given these stats is to sequence much deeper to improve the assembly, and absolutely use the normalization.

    If you think you can still do better given your existing reads, I'd suggest running some alternative assembly tools and then contrast them using the approaches we outline on our wiki, including the use of detonate.

    I also suspect the assembly you have right now is still plenty good enough to accomplish many of your research goals.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    18 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    22 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    17 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Working...
    X