SEQanswers Statistical model for RNA-Seq sensitivity estimation
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post Newsbot! Literature Watch 0 05-24-2011 02:00 AM Newsbot! Literature Watch 0 03-26-2011 02:02 AM Newsbot! Literature Watch 0 03-10-2011 03:00 AM Newsbot! Literature Watch 12 12-20-2010 09:10 PM Newsbot! Literature Watch 0 07-31-2010 02:40 AM

 02-14-2012, 07:50 AM #1 schelhorn Member   Location: Germany Join Date: Sep 2010 Posts: 10 Statistical model for RNA-Seq sensitivity estimation Dear All, I apologize if an existing answer to the following, basic question is somewhere buried in the forum - if yes, then a quick search did not reveal it. I'm looking for the right statistical model to compute the required sequencing depth for detecting a rare isoform with a certain probability in RNA-Seq data. Or, in other words, I would like to compute the sensitivity of an RNA-Seq experiment for finding minority isoforms at a given sequencing depth and isoform characteristic. The problem is closely related to differential expression analysis but I have serious problems combining the right models (poisson, betabin) at the right positions. Perhaps one of the statistically minded people working on RNA-Seq has an idea. Of course, partial solutions or caveats pointed out are also very welcome. Here is a contrived example with rough numbers: Let's assume that I want to look for a rare isoform that only occurs as n (=10) of the N (=100,000) overal mRNA transcripts per cell. How many reads of length r (=100bp) do I need to sequence from my library derived from the total mRNA of k (=1,000,000) cells so that I will sequence at least m (=3) reads from my rare isoform at a probability of P >=p (=0.999)? Bonus points: of course, a useful estimate may also depend on how easily I can distinguish the rare isoform from its more abundant brethren originating from the same gene. After all, I may receive reads from my rare isoform with probability P but only ones that are indistinguishable from the other isoforms since the isoforms are identical for most of the sequence. For simplicity, let's assume that all isoforms of the gene are L=(1000bp) long and can be differentiated from each other by one single stretch of length l=(200bp) which encodes an alternatively spliced exon und uniquely tags an isoform. I realize this is a complex example, but perhaps it's not without merit. Also, who better to ask it than you guys. Anyways, thanks for any insights! Cheers, Sven -- Sven-Eric Schelhorn - http://mpi-inf.mpg.de/~sven Max Planck Institute for Informatics, Saarbrücken D3 - Computational Biology & Applied Algorithmics
 02-14-2012, 10:56 AM #2 amitm Member   Location: Manchester, UK Join Date: Feb 2011 Posts: 52 very interesting question.. any replies please... :-)
 02-24-2012, 06:32 AM #3 Simon Anders Senior Member   Location: Heidelberg, Germany Join Date: Feb 2010 Posts: 991 You are halfway there, and a few extra preparation make things easier. First, how many cDNA fragments do we get from your N transcript molecules? Let's say, we fragment to lf=200 bp pieces, and the average length of the genes is L=1000 bp. Then, we should get roughly N' = N*L/lf = 500,000 cDNA molecules out of that. How many of these tell us about the isoform that you want to detect? If only a single stretch of l=200 bp is useful to ascertain that it is the isoform of interest and not another one, and if your gene has length 1000 bp, then it fragments, on average, into 5 pieces, only one of which is useful, i.e., we get n'=10 useful cDNA molecules which we have to fish out from N'=500,000 molecules. So, taking a random read, the probability that it is from the transcript stretch you are looking for is p = n'/N' = 1/50,000. Note that we did not need the number of cells; we just assume that we have enough so that we can ignore the possibility that the few cells we are looking at happen to not contain the rare isoform, or that we lose it during sample prep because there are so few. Hence, the answer is simple now: If you get a total of, say NR = 2,000,000 reads from your sequencing run, the probability that none of these contains the stretch of sequence that proves existence of your isoform, is given by (1-p)^NR=4e-18, i.e., it is nearly certain that you find it, using the numbers you suggested. This is because the expected number of reads from the stretch, p*NR, is 40, which is a lot. If n' were smaller, say, 1, this will look differently. Finally, if you want to quantify the abundance n/N, the precision of your quantification is roughly 1/sqrt(p*NR), due to Poisson noise. Here 1/sqrt(40)=16%.
 02-24-2012, 07:53 PM #4 Jon_Keats Senior Member   Location: Phoenix, AZ Join Date: Mar 2010 Posts: 279 I think it's time to write up a book of Simon's replies. I'm constantly stunned by the things I still do not understand but greatly appreciate being educated.
 02-27-2012, 02:25 AM #5 schelhorn Member   Location: Germany Join Date: Sep 2010 Posts: 10 Thanks Simon for setting this straight for me. And I agree with Jon's suggestion. Your answers usually are both precise and comprehensive, which make you a great asset for this forum.
 08-27-2013, 02:17 PM #6 tantramarseq Junior Member   Location: New Brunswick, Canada Join Date: Aug 2013 Posts: 1 Thanks Hi - an outdated thanks for this informative thread. I am setting up some course notes and the arithmetic was helpful. I think there is a term switch between the OP and Simon, using 'p' for two different factors (OP - desired statistical power; Simon - probability of a random read coming from the target transcript). cheers, Doug

 Tags depth, isoform, rna-seq, sensitivity, statistics