Trinity de novo Assemblies Only 50% Accurate-Help please!

adisaxena

Junior Member

Join Date: Aug 2015

Posts: 6
- Share
- Tweet
#1

Trinity de novo Assemblies Only 50% Accurate-Help please!

01-05-2016, 12:09 PM

Hi,

I am assembling the transcriptome of a dessert rodent, the lesser egyptian jerboa, using Trinity. I have 2 de novo Trinity assemblies but only 50% of the PE reads used for assembly map back to the same transcript. I am writing to seek help on how to improve the accuracy of my assembly.

Here are the details of my library preps and Trinity assembly:

1. Libraries were prepared with illumina strand specific mRNA protocol and RNA RINs were 8-9. We ran PE100 runs on 3 different tissue types (in duplicates=6 different indexed libraries) pooled in a single lane; this yielded a total of 372 million reads (~50-70 million reads per sample). The raw data was of quite good quality (average Phred >32). 

2. Trimmomatic was used to clip illumina adapters/indexes on the reads and to check read qualities. I used the PAIRED-END mode of Trimmomatic with these parameters: 
ILLUMINACLIP:/opt/biotools/trimmomatic/adapters/TruSeq3-PE-2.fa:2:30:12:1:true LEADING:30 TRAILING:20 SLIDINGWINDOW:10:25 MINLEN:75 
After clipping and quality checks I had 322.6 million PE reads with average Phred of 39-40, sequence lengths 75-150bp and 51% GC content across reads. 

3. I then performed Trinity assembly with Trinity version trinityrnaseq_r20140717 like so: 
--seqType fq --SS_lib_type FR --left JacPE100.R1.fq --right JacPE100.R2.fq --CPU 16 --JM 350G 

I used default Trinity Settings:- 
Inchworm step:
Kmer length set to: 25
Min assembly length set to: 25
Monitor turned on, set to: 1

Chrysalis step: -min_contig_length 200*
-min_glue 2*
-glue_factor 0.05*
-min_iso_ratio 0.05*
-t 16*
-k 24*
-kk 48*

During this assembly, Trinity identified ~964.5 million unique KMERs. After assembly, I mapped back all the PE reads to my assembled transcriptome as described here-
http://tinyurl.com/zearrod. The Assembly Stats are as follows:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 306290
Total trinity transcripts: 369174
Percent GC: 48.47

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 5172
Contig N20: 3581
Contig N30: 2517
Contig N40: 1717
Contig N50: 1095

Median contig length: 333
Average contig: 659.77
Total assembled bases: 243568232

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 4276
Contig N20: 2665
Contig N30: 1655
Contig N40: 1006
Contig N50: 664

Median contig length: 313
Average contig: 540.34
Total assembled bases: 165499422

Bowtie_Alignment Stats:

[adi@comet-ln3 adi]$ cat slurm-1363147.out
Thu Dec 10 20:51:34 PST 2015
[1,385,700,000] lines read

#read_type count pct
proper_pairs 3 18231920 52.52
improper_pairs 279823240 46.18
right_only 4403056 0.73
left_only 3523357 0.58

Total aligned reads: 605981573

4. I normalised my PE reads (100x) to test if it improves my assembly like so- 
/opt/biotools/trinity/util/insilico_read_normalization.pl --seqType fq --SS_lib_type FR --left JacPE100.R1.fq --right JacPE100.R2.fq --CPU 24 --JM 400G --max_cov 100 --pairs_together --PARALLEL_STATS* 
This reduced the number down to 28 million (from 322million)* PE reads. All were still of great quality like before. I ran Trinity assembly again on this normalised data set using the default parameters like before. Trinity identified 373.72 million unique KMERs this time. The stats for this assembly are here:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 260331
Total trinity transcripts: 337680
Percent GC: 48.94

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 5605
Contig N20: 4089
Contig N30: 3133
Contig N40: 2384
Contig N50: 1731

Median contig length: 363
Average contig: 812.09
Total assembled bases: 274224929

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 4655
Contig N20: 3047
Contig N30: 1994
Contig N40: 1209
Contig N50: 757

Median contig length: 318
Average contig: 572.85
Total assembled bases: 149130461

Bowtie_Alignment Stats:

[adi@comet-ln3 adi]$ cat slurm-1402391.out
Wed Dec 23 15:48:53 PST 2015
[124,600,000] lines read

#read_type count pct
proper_pairs 24156938 49.22
improper_pairs 23886180 48.67
right_only 589379 1.20
left_only 448786 0.91

Total aligned reads: 49081283

It appears that ~50% of my assemblies are false and I understand that typically Trinity has 70-80% proper-pairs mapping back to the assembly. Wonder if we may have to tweak trinity parameters to improve assemblies? I Shall be most grateful for any help and advice you could provide us with! 
Wish you a very happy 2016! 
Tags: None
westerman

Rick Westerman

Join Date: Jun 2008

Posts: 1103
- Share
- Tweet
#2

01-06-2016, 09:52 AM

FYI.

Aditya also posted the question to the Trinity mailing list. Brian Haas' (one of Trinity's authors) reply was as follows:

The improper pairs count just indicates that the two reads of a pair are mapping to two different contigs and this is just because the corresponding transcripts are fragmented. We should probably use a different term to describe this.

It appears you have a lot of redundancy in your sequencing given the high amount of normalization. You might take a look at what those extremely highly expressed transcripts are. If it's rRNA, then you might question your library construction approach. The only thing I'd recommend to do here given these stats is to sequence much deeper to improve the assembly, and absolutely use the normalization.

If you think you can still do better given your existing reads, I'd suggest running some alternative assembly tools and then contrast them using the approaches we outline on our wiki, including the use of detonate.

I also suspect the assembly you have right now is still plenty good enough to accomplish many of your research goals.
Comment

Previous template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Trinity de novo Assemblies Only 50% Accurate-Help please!

Comment

Latest Articles

ad_right_rmr

News