SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to correctly estimate RNA-seq mean insert size and standard distribution pkstarstorm05 RNA Sequencing 3 01-15-2015 08:00 AM
Trying to estimate genome size - no peak in jellyfish kmer abundance plot kmkocot Bioinformatics 5 03-27-2014 07:24 AM
estimate genome size through kmer analysis plantae Bioinformatics 0 07-05-2012 03:46 AM
estimate genome size through kmer analysis plantae De novo discovery 0 07-05-2012 03:36 AM
Estimating genome size and coverage newbie25 454 Pyrosequencing 2 08-12-2010 09:34 AM

Reply
 
Thread Tools
Old 02-22-2015, 09:24 PM   #1
Alun3.1
Junior Member
 
Location: alberta

Join Date: Feb 2015
Posts: 8
Default Is it possible to estimate mRNA-seq depth/coverage just with genome size?

Hi,

Being a newbie in NGS, I have a very basic question.

I sequenced tissue mRNAs using a paired-end strategy.
Is it possible to calculate the depth of an overall mRNA-seq experiment when no reference genome or transcriptome data are available (but knowing only the genome size)?

Can we use the following formula or it is correct just for calculating genome depth?
coverage=(average length of reads)*(number of raw forward + reverse reads) / (haploid genome size).

I also read (UCSC - ENCODE Project: http://genome.ucsc.edu/ENCODE/protoc...dards_V1.0.pdf) that we can estimate the depth using this formula:
(number of NT sequenced / number of mRNA molecules per cell) / (average mRNA length)

Am I wrong if I say that it seems very approximate to me?
Because of the different levels of expression of every single transcript, does it make any sense trying to know the depth of a RNA-seq experiment?


Thanks for your help !

Last edited by Alun3.1; 02-23-2015 at 02:24 PM.
Alun3.1 is offline   Reply With Quote
Old 02-23-2015, 12:24 AM   #2
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by Alun3.1 View Post
Hi,
Being a newbie in NGS, I have a very basic question.

I sequenced tissue mRNAs using a paired-end strategy.
Is it possible to calculate the depth of an overall mRNA-seq experiment when no reference genome or transcriptome data are available (but knowing only the genome size)?
Not in any sensible way.

Quote:
Originally Posted by Alun3.1 View Post
Can we use the following formula or it is correct just for calculating genome depth?
coverage=(average length of reads)*(number of raw forward + reverse reads) / (haploid genome size).
That formula only makes sense for whole genome sequencing.

Quote:
Originally Posted by Alun3.1 View Post
I also read (UCSC - ENCODE Project: http://genome.ucsc.edu/ENCODE/protoc...dards_V1.0.pdf) that we can estimate the depth using this formula:
(number of NT sequenced / number of mRNA molecules per cell) / (average mRNA length)

Am I wrong if I say that it seems very approximative to me?
No, it is very approximate, and would only give you the average coverage of every transcript. Such an average coverage will be incorrect for most of the transcripts, as expression levels aren't normally distributed.

Quote:
Originally Posted by Alun3.1 View Post
Because of the different levels of expression of every single transcript, does it make any sense trying to know the depth of a RNA-seq experiment?

Thanks for your help !
The average depth makes little sense.

For RNA-Seq with differential expression analysis in mind, you usually select sequencing depth based on previous experience or some rule of thumb, as the exact numbers are unknown for your experiment (which is why you carry it out in the first place).

A place to start would be the following: for a "standard" DE experiment, with a typical "higher eukaryote" species, usually 10-50 million reads per sample are "enough". If you have many replicates, the lower end is usually fine, if you have few replicates and/or are interested in genes with a generally very low expression (e.g. transcription factors) - or are interested in fine-tuned gene regulation (small differences in expression between samples), the upper end would be recommended.
For a prokaryote, 5-20 million reads are "enough" - if your rRNA depletion protocol works well.

For RNA-Seq with transcriptome assembly as a primary goal, things change a bit as you can choose between different strategies. But I suppose you are interested in DE.
sarvidsson is offline   Reply With Quote
Old 02-23-2015, 06:21 AM   #3
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Strange. It seems like people use the term "depth of coverage" more often for RNAseq experiments, where it really doesn't make sense, more than they do for DNAseq, where it does.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-23-2015, 04:48 PM   #4
Alun3.1
Junior Member
 
Location: alberta

Join Date: Feb 2015
Posts: 8
Default

Thanks sarvidsson !

Quote:
for a "standard" DE experiment, with a typical "higher eukaryote" species, usually 10-50 million reads per sample are "enough"
So I assume it also depends on the species you study, the complexity of the transcriptome, the length of the reads (and the cost of the sequencing).
What about if you only focus on mRNAs and get the same number of reads (10-50 millions). As they are a (small) fraction of the total RNA, one could think that having 10-50 millions reads from mRNA is more complete than 10-50 million reads from total RNA, right? Then you could potentially detect rare transcripts without needing 100-200 millions reads?

Quote:
For RNA-Seq with transcriptome assembly as a primary goal, things change a bit as you can choose between different strategies. But I suppose you are interested in DE.
Yes, I am more into DE. But if you want to assemble a transcriptome, I assume (depending if it is a reference or de novo) you would need definitely more reads as long as possible?

Last edited by Alun3.1; 02-23-2015 at 05:10 PM.
Alun3.1 is offline   Reply With Quote
Old 02-23-2015, 11:15 PM   #5
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by Alun3.1 View Post
So I assume it also depends on the species you study, the complexity of the transcriptome, the length of the reads (and the cost of the sequencing).
Life is full of compromises

Quote:
Originally Posted by Alun3.1 View Post
What about if you only focus on mRNAs and get the same number of reads (10-50 millions). As they are a (small) fraction of the total RNA, one could think that having 10-50 millions reads from mRNA is more complete than 10-50 million reads from total RNA, right? Then you could potentially detect rare transcripts without needing 100-200 millions reads?
With undegraded RNA and a well-trained technician we typically get ~93-98 % mRNA, and with 30-50 million reads we typically see most known transcripts for the specific tissue (numbers depends on complexity of the tissue and species).

Some recommendations to read on the subject:
http://bioinformatics.oxfordjournals...ent/27/13/i383
http://www.biomedcentral.com/1471-2105/12/S10/S5

Quote:
Originally Posted by Alun3.1 View Post
Yes, I am more into DE. But if you want to assemble a transcriptome, I assume (depending if it is a reference or de novo) you would need definitely more reads as long as possible?
IMO both is necessary - I'd recommend a wet-lab normalized cDNA library on 1/2 to 1 MiSeq V3 (2x300 bp) run (or possibly PacBio, we don't have one however) + whatever samples you would like to study the expression on as many HiSeq lanes you need. Then in silico normalize the HiSeq reads and assemble the whole thing.
sarvidsson is offline   Reply With Quote
Old 02-24-2015, 05:11 AM   #6
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by sarvidsson View Post
With undegraded RNA and a well-trained technician we typically get ~93-98 % mRNA, and with 30-50 million reads we typically see most known transcripts for the specific tissue (numbers depends on complexity of the tissue and species).
If your main method for determining which genes are expressed in a given tissue is sequencing 30-50 million reads from its transcriptome, then what you see when you sequence 30-50 million reads from a tissue will be
"most known transcripts for the specific tissue".

Which is fine. But the 30-50 million reads figure is just what is fashionable at the moment. Should it become possible to obtain 300-500 million reads per sample for around $500/ 440, that will probably become the new standard.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-24-2015, 05:23 AM   #7
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by pmiguel View Post
If your main method for determining which genes are expressed in a given tissue is sequencing 30-50 million reads from its transcriptome, then what you see when you sequence 30-50 million reads from a tissue will be
"most known transcripts for the specific tissue".

Which is fine. But the 30-50 million reads figure is just what is fashionable at the moment. Should it become possible to obtain 300-500 million reads per sample for around $500/€ 440, that will probably become the new standard.
Point taken. But if 300-500 million reads per sample would be that cheap, for most research questions I'd rather analyze 5 times more samples at 60-100 million reads per sample, provided that library costs follow the same trend.
sarvidsson is offline   Reply With Quote
Old 02-24-2015, 07:13 PM   #8
Alun3.1
Junior Member
 
Location: alberta

Join Date: Feb 2015
Posts: 8
Default

Thanks guys for your replies !
Alun3.1 is offline   Reply With Quote
Old 02-25-2015, 10:15 AM   #9
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by sarvidsson View Post
Point taken. But if 300-500 million reads per sample would be that cheap, for most research questions I'd rather analyze 5 times more samples at 60-100 million reads per sample, provided that library costs follow the same trend.
And yet, there were DE experiments done on 1/4 PTP 454 runs that generated less typically less than 200K reads split among lots of samples. If a DE experiment back then generated 40,000 reads per sample was considered reasonable--still with 3 replicates-- then why don't people do 15 replicates now?

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-25-2015, 11:03 PM   #10
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by pmiguel View Post
And yet, there were DE experiments done on 1/4 PTP 454 runs that generated less typically less than 200K reads split among lots of samples. If a DE experiment back then generated 40,000 reads per sample was considered reasonable--still with 3 replicates-- then why don't people do 15 replicates now?
The library costs tend to be prohibitive for the academic customers we have - from 454 to Illumina these costs haven't dropped by far as much as the sequencing costs have. So a "screen few samples with RNA-Seq, then validate on many samples by RT-qPCR" mentality is quite common. I could speculate on other reasons as well - e.g. statistical training is seldom attractive to biology PhD students here. The commerical customers we have are generally more interested in speedy results, so tend to spend more money on RNA-Seq libraries... but this is just my current experience.
sarvidsson is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:22 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO