SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-Seq: The sensitivity of massively parallel sequencing for detecting candidate inf Newsbot! Literature Watch 0 05-24-2011 02:00 AM
RNA-Seq: Composite Transcriptome Assembly of RNA-seq data in a Sheep Model for Delaye Newsbot! Literature Watch 0 03-26-2011 02:02 AM
RNA-Seq: Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Expe Newsbot! Literature Watch 0 03-10-2011 03:00 AM
RNA-Seq: A survey of statistical software for analysing RNA-seq data. Newsbot! Literature Watch 12 12-20-2010 09:10 PM
RNA-Seq: A two-parameter generalized Poisson model to improve the analysis of RNA-seq Newsbot! Literature Watch 0 07-31-2010 02:40 AM

Reply
 
Thread Tools
Old 02-14-2012, 07:50 AM   #1
schelhorn
Member
 
Location: Germany

Join Date: Sep 2010
Posts: 10
Default Statistical model for RNA-Seq sensitivity estimation

Dear All,

I apologize if an existing answer to the following, basic question is somewhere buried in the forum - if yes, then a quick search did not reveal it.

I'm looking for the right statistical model to compute the required sequencing depth for detecting a rare isoform with a certain probability in RNA-Seq data. Or, in other words, I would like to compute the sensitivity of an RNA-Seq experiment for finding minority isoforms at a given sequencing depth and isoform characteristic.

The problem is closely related to differential expression analysis but I have serious problems combining the right models (poisson, betabin) at the right positions. Perhaps one of the statistically minded people working on RNA-Seq has an idea. Of course, partial solutions or caveats pointed out are also very welcome.

Here is a contrived example with rough numbers: Let's assume that I want to look for a rare isoform that only occurs as n (=10) of the N (=100,000) overal mRNA transcripts per cell. How many reads of length r (=100bp) do I need to sequence from my library derived from the total mRNA of k (=1,000,000) cells so that I will sequence at least m (=3) reads from my rare isoform at a probability of P >=p (=0.999)?

Bonus points: of course, a useful estimate may also depend on how easily I can distinguish the rare isoform from its more abundant brethren originating from the same gene. After all, I may receive reads from my rare isoform with probability P but only ones that are indistinguishable from the other isoforms since the isoforms are identical for most of the sequence. For simplicity, let's assume that all isoforms of the gene are L=(1000bp) long and can be differentiated from each other by one single stretch of length l=(200bp) which encodes an alternatively spliced exon und uniquely tags an isoform.

I realize this is a complex example, but perhaps it's not without merit. Also, who better to ask it than you guys. Anyways, thanks for any insights!

Cheers, Sven

--
Sven-Eric Schelhorn - http://mpi-inf.mpg.de/~sven
Max Planck Institute for Informatics, Saarbrücken
D3 - Computational Biology & Applied Algorithmics
schelhorn is offline   Reply With Quote
Old 02-14-2012, 10:56 AM   #2
amitm
Member
 
Location: Manchester, UK

Join Date: Feb 2011
Posts: 52
Thumbs up

very interesting question..
any replies please... :-)
amitm is offline   Reply With Quote
Old 02-24-2012, 06:32 AM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

You are halfway there, and a few extra preparation make things easier. First, how many cDNA fragments do we get from your N transcript molecules? Let's say, we fragment to lf=200 bp pieces, and the average length of the genes is L=1000 bp. Then, we should get roughly N' = N*L/lf = 500,000 cDNA molecules out of that. How many of these tell us about the isoform that you want to detect? If only a single stretch of l=200 bp is useful to ascertain that it is the isoform of interest and not another one, and if your gene has length 1000 bp, then it fragments, on average, into 5 pieces, only one of which is useful, i.e., we get n'=10 useful cDNA molecules which we have to fish out from N'=500,000 molecules. So, taking a random read, the probability that it is from the transcript stretch you are looking for is p = n'/N' = 1/50,000.

Note that we did not need the number of cells; we just assume that we have enough so that we can ignore the possibility that the few cells we are looking at happen to not contain the rare isoform, or that we lose it during sample prep because there are so few.

Hence, the answer is simple now: If you get a total of, say NR = 2,000,000 reads from your sequencing run, the probability that none of these contains the stretch of sequence that proves existence of your isoform, is given by (1-p)^NR=4e-18, i.e., it is nearly certain that you find it, using the numbers you suggested. This is because the expected number of reads from the stretch, p*NR, is 40, which is a lot. If n' were smaller, say, 1, this will look differently.

Finally, if you want to quantify the abundance n/N, the precision of your quantification is roughly 1/sqrt(p*NR), due to Poisson noise. Here 1/sqrt(40)=16%.
Simon Anders is offline   Reply With Quote
Old 02-24-2012, 07:53 PM   #4
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

I think it's time to write up a book of Simon's replies. I'm constantly stunned by the things I still do not understand but greatly appreciate being educated.
Jon_Keats is offline   Reply With Quote
Old 02-27-2012, 02:25 AM   #5
schelhorn
Member
 
Location: Germany

Join Date: Sep 2010
Posts: 10
Default

Thanks Simon for setting this straight for me. And I agree with Jon's suggestion. Your answers usually are both precise and comprehensive, which make you a great asset for this forum.
schelhorn is offline   Reply With Quote
Old 08-27-2013, 02:17 PM   #6
tantramarseq
Junior Member
 
Location: New Brunswick, Canada

Join Date: Aug 2013
Posts: 1
Default Thanks

Hi - an outdated thanks for this informative thread.
I am setting up some course notes and the arithmetic was helpful.
I think there is a term switch between the OP and Simon, using 'p' for two different factors (OP - desired statistical power; Simon - probability of a random read coming from the target transcript).
cheers, Doug
tantramarseq is offline   Reply With Quote
Reply

Tags
depth, isoform, rna-seq, sensitivity, statistics

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO