
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
RNASeq: The sensitivity of massively parallel sequencing for detecting candidate inf  Newsbot!  Literature Watch  0  05242011 03:00 AM 
RNASeq: Composite Transcriptome Assembly of RNAseq data in a Sheep Model for Delaye  Newsbot!  Literature Watch  0  03262011 03:02 AM 
RNASeq: Accurate Estimation of Expression Levels of Homologous Genes in RNAseq Expe  Newsbot!  Literature Watch  0  03102011 04:00 AM 
RNASeq: A survey of statistical software for analysing RNAseq data.  Newsbot!  Literature Watch  12  12202010 10:10 PM 
RNASeq: A twoparameter generalized Poisson model to improve the analysis of RNAseq  Newsbot!  Literature Watch  0  07312010 03:40 AM 

Thread Tools 
02142012, 08:50 AM  #1 
Member
Location: Germany Join Date: Sep 2010
Posts: 10

Statistical model for RNASeq sensitivity estimation
Dear All,
I apologize if an existing answer to the following, basic question is somewhere buried in the forum  if yes, then a quick search did not reveal it. I'm looking for the right statistical model to compute the required sequencing depth for detecting a rare isoform with a certain probability in RNASeq data. Or, in other words, I would like to compute the sensitivity of an RNASeq experiment for finding minority isoforms at a given sequencing depth and isoform characteristic. The problem is closely related to differential expression analysis but I have serious problems combining the right models (poisson, betabin) at the right positions. Perhaps one of the statistically minded people working on RNASeq has an idea. Of course, partial solutions or caveats pointed out are also very welcome. Here is a contrived example with rough numbers: Let's assume that I want to look for a rare isoform that only occurs as n (=10) of the N (=100,000) overal mRNA transcripts per cell. How many reads of length r (=100bp) do I need to sequence from my library derived from the total mRNA of k (=1,000,000) cells so that I will sequence at least m (=3) reads from my rare isoform at a probability of P >=p (=0.999)? Bonus points: of course, a useful estimate may also depend on how easily I can distinguish the rare isoform from its more abundant brethren originating from the same gene. After all, I may receive reads from my rare isoform with probability P but only ones that are indistinguishable from the other isoforms since the isoforms are identical for most of the sequence. For simplicity, let's assume that all isoforms of the gene are L=(1000bp) long and can be differentiated from each other by one single stretch of length l=(200bp) which encodes an alternatively spliced exon und uniquely tags an isoform. I realize this is a complex example, but perhaps it's not without merit. Also, who better to ask it than you guys. Anyways, thanks for any insights! Cheers, Sven  SvenEric Schelhorn  http://mpiinf.mpg.de/~sven Max Planck Institute for Informatics, Saarbrücken D3  Computational Biology & Applied Algorithmics 
02142012, 11:56 AM  #2 
Member
Location: Manchester, UK Join Date: Feb 2011
Posts: 52

very interesting question..
any replies please... :) 
02242012, 07:32 AM  #3 
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 993

You are halfway there, and a few extra preparation make things easier. First, how many cDNA fragments do we get from your N transcript molecules? Let's say, we fragment to lf=200 bp pieces, and the average length of the genes is L=1000 bp. Then, we should get roughly N' = N*L/lf = 500,000 cDNA molecules out of that. How many of these tell us about the isoform that you want to detect? If only a single stretch of l=200 bp is useful to ascertain that it is the isoform of interest and not another one, and if your gene has length 1000 bp, then it fragments, on average, into 5 pieces, only one of which is useful, i.e., we get n'=10 useful cDNA molecules which we have to fish out from N'=500,000 molecules. So, taking a random read, the probability that it is from the transcript stretch you are looking for is p = n'/N' = 1/50,000.
Note that we did not need the number of cells; we just assume that we have enough so that we can ignore the possibility that the few cells we are looking at happen to not contain the rare isoform, or that we lose it during sample prep because there are so few. Hence, the answer is simple now: If you get a total of, say NR = 2,000,000 reads from your sequencing run, the probability that none of these contains the stretch of sequence that proves existence of your isoform, is given by (1p)^NR=4e18, i.e., it is nearly certain that you find it, using the numbers you suggested. This is because the expected number of reads from the stretch, p*NR, is 40, which is a lot. If n' were smaller, say, 1, this will look differently. Finally, if you want to quantify the abundance n/N, the precision of your quantification is roughly 1/sqrt(p*NR), due to Poisson noise. Here 1/sqrt(40)=16%. 
02242012, 08:53 PM  #4 
Senior Member
Location: Phoenix, AZ Join Date: Mar 2010
Posts: 279

I think it's time to write up a book of Simon's replies. I'm constantly stunned by the things I still do not understand but greatly appreciate being educated.

02272012, 03:25 AM  #5 
Member
Location: Germany Join Date: Sep 2010
Posts: 10

Thanks Simon for setting this straight for me. And I agree with Jon's suggestion. Your answers usually are both precise and comprehensive, which make you a great asset for this forum.

08272013, 03:17 PM  #6 
Junior Member
Location: New Brunswick, Canada Join Date: Aug 2013
Posts: 1

Thanks
Hi  an outdated thanks for this informative thread.
I am setting up some course notes and the arithmetic was helpful. I think there is a term switch between the OP and Simon, using 'p' for two different factors (OP  desired statistical power; Simon  probability of a random read coming from the target transcript). cheers, Doug 
Tags 
depth, isoform, rnaseq, sensitivity, statistics 
Thread Tools  

