SEQanswers Cuffdiff FPKM and test statistic calculations
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post madsaan Bioinformatics 3 12-12-2012 04:14 PM mrfox Bioinformatics 5 10-17-2012 01:10 PM peromhc Bioinformatics 6 10-17-2012 01:07 PM combiochem Bioinformatics 12 10-13-2012 11:37 PM tdm Bioinformatics 0 01-31-2011 12:58 PM

 11-14-2011, 02:04 PM #1 PRingler Junior Member   Location: Nevada Join Date: Mar 2011 Posts: 1 Cuffdiff FPKM and test statistic calculations Hello, I have been lurking here a bit, but haven't posted yet, so bear with me. I am trying to recreate the calculations of FPKM and the test statistic performed by uffdiff on two sample genes in order to better understand how my RNA-seq data has been tested (and thus better explain my data). I have been reading through the supplemental information from the paper linked on the Cufflinks website (Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation; Nature Biotechnology 28,511–515(2010)doi:10.1038/nbt.1621), and I have been able to get through part of the calculations, but am still stumped. To calculate FPKM, I am using the following formula: (10^9)*Xg*(gamma t) ~l(t)*M Where: Xg is the number of fragments aligned to the gene locus (g), gamma t is the maximum likelihood estimate of the probability of selecting a fragment from a transcript (t) coming from that gene locus ~l(t) is the adjusted transcript length Σ from i=1 to t(i)) [F(i)(l(t)-i+1)] F(i) = probability that the fragment has a length i l(t) = the transcript length M is the total number of fragments mapped in that sample My sample is run with single-end reads 50 bp long, so I simplified the ~l(t) to the transcript length - 49, and we ran cuffdiff using a GTF file that contains only the loci for one transcript for each gene, so I think gamma should be 1. One of my lab members wrote a program that assigns raw reads to individual genes, so I am using the reads from this program for the raw reads in the FPKM calculation. Using this data I get pretty close to the FPKM calculated: Code: ```gene Reads A Reads B Cuff A Cuff B Calc A Calc B A 2 81 0.021956 0.823028 0.02116 0.795515 B 1 40 0.007541 0.279201 0.007374 0.273778``` The difference is most likely from my assumption that the ~l(t) is only looking at the 50 bp read length, or the assumption that gamma is 1. From this, I get into much more complicated math when trying to recreate the test statistic. The formula for the test statistic, from the supplemental data, is: [log(Xa)+log(gamma a)+ log(Mb) - log(Xb) - log(gamma b) - log (Ma)] SQRT[ (psi a*(1+Xa)*(gamma a)^2)/(Xa*(gamma a)^2)+ (psi b*(1+Xb)*(gamma b)^2)/(Xb*(gamma b)^2)] Where psi is a variance-covariance matrix that estimates the covariance between gamma (tk) and gamma (tl). As far as I can tell, tk and tl come from Tophat, which splits a read of length l into two smaller reads of length k in order to align across splice junctions. (This may be completely off base) When I run the equations using these assumptions, though, I get a value far off from the test statistic calculated by Cuffdiff: Code: ```Gene T stat Cufflinks t-stat A 0.99128963 -3.42295 B 0.898803344 -3.07139``` So, I suppose my question is, is there a way for me to calculate (or even estimate) the psi function? Am I making faulty assumptions? I apologize if this question has been addressed in a previous thread. I did try to find the answer in the archives before asking here. Thank you
 07-03-2012, 10:29 AM #2 Portah Member   Location: Cincinnati Join Date: Jan 2012 Posts: 14 Does some one has an answer for the second part of the question ?
 10-16-2012, 02:47 AM #3 IBseq Member   Location: uk Join Date: Jul 2012 Posts: 56 Hi PRinlgler, did you come up with a solution? I'm also quite new and tried to understand what is the mathematical process from cufflinks to cuffdiff...why are the fpkm in cufflinks differnt from those given in cuffdiff?? thanks, ib