SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Reduce complexity of a de novo assembly prior to annotation FGponce Bioinformatics 3 03-11-2013 02:47 PM
paired-end read length for de novo assembly Seqasaurus Illumina/Solexa 4 10-19-2011 04:32 AM
Groom data prior to using GS de novo Assembler? grassgirl 454 Pyrosequencing 5 09-29-2011 03:47 PM
de novo assembly and Illumina read length bkingham De novo discovery 1 11-17-2009 12:15 AM
De novo short read assembly? Which assembler is the best? Patrick De novo discovery 0 06-23-2009 07:42 PM

Reply
 
Thread Tools
Old 03-21-2013, 01:02 PM   #1
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default In Silico read normalization prior to de novo assembly

I am about to use Trinity for de novo transcriptome assembly prior to differential expression analyses.

I have 11 individuals (5 control, 6 treated) with 3 tissue types = 33 samples with ~20 million ~80bp single-end reads each (after trimming and QC).... so that's about 660 million single end reads!

In order to reduce what is likely to be a LONG trinity run, would you suggest utilizing Trinity's normalization script or similar (e.g. khmer) prior to assembly?
Or should I just take a small subset of samples to make assembly?

I don't know how much individual genetic variability there is so I'm worried that using a subset for assembly will miss rarer transcripts.

Does anyone here have any experience with normalization? Are there any downsides to this method over using a subset of samples?

Any advice or experiences much appreciated!
AmyEllison is offline   Reply With Quote
Old 03-27-2013, 05:30 PM   #2
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

It seems Trinity's In silico Read Normalization hasn't been publised.
pengchy is offline   Reply With Quote
Old 03-27-2013, 05:46 PM   #3
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

the following links may be helpful:

DigiNorm on Paired-end samples
http://seqanswers.com/forums/showthread.php?t=23612

What is digital normalization, anyway?
http://ivory.idyll.org/blog/what-is-diginorm.html

Digital normalization of short-read shotgun data
http://ivory.idyll.org/blog/diginorm-paper-posted.html

Basic Digital Normalization
https://wiki.hpcc.msu.edu/display/Bi...+Normalization

A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
http://arxiv.org/abs/1203.4802

What does Trinity's In Silico normalization do?
http://ivory.idyll.org/blog/trinity-...normalize.html

Last edited by pengchy; 10-16-2013 at 01:06 AM.
pengchy is offline   Reply With Quote
Old 03-28-2013, 05:46 AM   #4
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Thanks pengchy!

I have read all already and was wondering if anyone had any experiences with their own data?

It has also been suggested to me to take all the reads from 1 individual (all tissue types) and assemble as there may be too much ambiguity with using multiple individuals.

The samples are clutch mates (frogs), but not inbred lines so there will be some variability between them and heterozygosity. But the question is with that option - which individual? Control or treated?

Any thoughts from you knowledgeable lot on seqanswers much appreciated!!
AmyEllison is offline   Reply With Quote
Old 03-28-2013, 05:56 AM   #5
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Hi Amy,

I am preparing to do the work and glad to exchange the experience with you here when i finish the test.

best,
pch
pengchy is offline   Reply With Quote
Old 03-28-2013, 06:05 AM   #6
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Great thanks!

I am running Trinity's method at the moment (would have liked to use Titus's more efficient version of Trinity's method but waiting for that to be installed) - it's been running for 2 days now.

I gave it 100GB RAM and 10 CPUs - which seems to have been OK for jellyfish, reading kmer occurences. Its been writing the .stats file now for a looooong time but it's not maxing out the memory and only using 1 cpu.
AmyEllison is offline   Reply With Quote
Old 03-28-2013, 12:56 PM   #7
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I find that Trinity's normalization takes a long time to run. Almost defeats the purpose of normalization in the first place. Days of run time -- yeap. We need to fix this some day.
westerman is offline   Reply With Quote
Old 04-01-2013, 12:30 PM   #8
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

If anyone is interested:

Trinity normalization on ~854 million reads took about 2 days on a high-memory machine (gave it 300GB memory and 40 cores)

Got it down from 854 to just 66 million reads!
AmyEllison is offline   Reply With Quote
Old 04-02-2013, 01:39 AM   #9
2seq
Junior Member
 
Location: Israel

Join Date: Oct 2012
Posts: 9
Default

That's great that you got the number of reads reduced, but how is that reduction expected to improve performance on Trinity? Will it cut the time down considerably (enough to justify normalization)?

Best regards and great to find others working on similar projects!
2seq is offline   Reply With Quote
Old 04-02-2013, 05:41 AM   #10
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Well that's the part I'd like other's experiences! This is my first de novo assembly.

To assemble the normalized reads (using same number of cores, memory etc) took less than a day. I'm running the full 854 million now to see how the assemblies compare - it's been going 2 days already.

It was also suggested I try assembling using all tissues from 1 individual (but to be careful for further analysis as this individual will map better to the assembly) as variability between individuals could create ambiguity in assembly.

I tried this: all normalized N50 = 1596, single individual N50 = 2029. Bowtie mapping back to assembly normalized = 80.65%, individual = 79.02%. I'm currently blasting to see which annotates better.

Does anyone have any other thoughts on how to test which is the "best" assembly?
AmyEllison is offline   Reply With Quote
Old 04-02-2013, 07:08 AM   #11
2seq
Junior Member
 
Location: Israel

Join Date: Oct 2012
Posts: 9
Default

I'm sorry I don't have any answers for you. I've got ~270 million reads so I'm not doing the normalization step for this run, but I will continue to watch this thread to see how your experiment comes out in the end. I'll be posting a question about installation of trinity with regard to jellyfish...feel free to take a peak, maybe its something you encountered?
2seq is offline   Reply With Quote
Old 04-04-2013, 07:29 AM   #12
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

In the end the assembly of the full set of reads took only about 3 days - so 2 days normalizing and 1 day assembly amounts to little or no saving on time.

The full read assembly only gave rise to marginally more contigs (~455000 vs ~447000 from normalized reads) and a lower N50 (1227 vs 1596).

I think Titus Brown's version of Trinity's method (which unfortunately I could not get installed on our machines here yet) probably does make normalizing worth it for my kind of sized read set.
AmyEllison is offline   Reply With Quote
Old 04-04-2013, 11:21 PM   #13
2seq
Junior Member
 
Location: Israel

Join Date: Oct 2012
Posts: 9
Default

fyi for those with access to more computing power - I used 24 cores and 119G on my 270 million reads without normalization and finished in 1 day.
But it may have also gone a bit faster b/c I ran it with the --min_kmer_cov 2 parameter
2seq is offline   Reply With Quote
Old 04-11-2013, 07:19 AM   #14
eppi
Junior Member
 
Location: Massachusetts

Join Date: Sep 2011
Posts: 3
Default map back to the assembly

Hello everybody, interesting discussion.
Here we used Trinity on 10 samples, 5 tissues from 2 animals, sick and not sick. Total reads were >600M and on a 'big machine', sorry not sure of RAM and cores, it took <3 days.

The problem is that mapping the reads back to the contigs as suggested will map only 30% back!! Any clue? Does anyone else have this problem? Is this an issue or it is normal due to tissue diversity?

Thanks for your help!
eppi
eppi is offline   Reply With Quote
Old 04-11-2013, 07:33 AM   #15
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Hi eppi,

I'm afraid I don't have that problem but out of interest, how many transcripts and components did you get?

I have produced ~447,000 transcripts (~350,000 component) - this seems far too many. I'm worried its from pooling tissues and individuals together for assembly? Anyone else got such a large number?? Any suggestions on how to reduce redundancy?
AmyEllison is offline   Reply With Quote
Old 04-11-2013, 07:56 AM   #16
eppi
Junior Member
 
Location: Massachusetts

Join Date: Sep 2011
Posts: 3
Default

Hi AmyEllison,

we got 'only' >1,5 million contigs....so I dropped the ball and thought something was wrong. I must say that I mapped only 1 of the sample back to the assembly, so maybe it is a sample specific issue. Therefore, I am inclined to think that it could be from pooling samples. (by the way the GB of the machine were about 244). Thanks!
eppi is offline   Reply With Quote
Old 07-19-2014, 05:57 PM   #17
crusoe
Programmer & Bioinformatician
 
Location: Lansing, MI

Join Date: Oct 2012
Posts: 10
Default

It is our (GED Lab, the home of the khmer project) experience that diginorming prior to running Trinity works as well and is faster than Trinity's built-in implementation.

Our mRNASeq protocol can guide one through the entire process: http://khmer-protocols.readthedocs.o...seq/index.html

We have also greatly improved the installation procedure: http://khmer.readthedocs.org/en/latest/install.html

Cheers,
crusoe is offline   Reply With Quote
Reply

Tags
de novo assembly, read normalization, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO