Oof! I'm on the last leg of my first project here at UCLA in the Pellegrini Lab. I'm supposed to figure out which genes are differentially expressed (say, with a p-value < 0.001). I used Fisher's Exact Test to calculate directly from the FPKM (Fragments Per Kilobase of gene per Million reads), and the calculation took about 5 seconds on my laptop, and I got very significant results. Very disturbing that it was so easy.
So I started getting this inkling that maaaaybe I can't do the FET on FPKMs, as each gene is normalized by its own length, so the totals of the FPKMs fall out of proportion.
Question 1: Am I correct in thinking this?
If answer(Question 1) == yes,
I tried running the FET on the actual counts of each gene per million reads, and overnight, my computer calculated about 60 of them. The problem is, I have 27,644 genes I need to do the test for. I doubt even on a big cluster that I'll be able to get results in a reasonable amount of time.
Question 2: Does anybody have any suggestions or alternatives?
Also, as a side note, I'm running the p-value calculations in MATLAB using DCFisherextest.m, which runs 2x2 contingency tables using the approximation log10(x!) =~ gammaln(x+1)/log(10), which is significantly faster and more accurate than direct factorials (MATLAB does not take anything higher than 170, and my data is several orders of magnitude higher).
Thank you in advance!
So I started getting this inkling that maaaaybe I can't do the FET on FPKMs, as each gene is normalized by its own length, so the totals of the FPKMs fall out of proportion.
Question 1: Am I correct in thinking this?
If answer(Question 1) == yes,
I tried running the FET on the actual counts of each gene per million reads, and overnight, my computer calculated about 60 of them. The problem is, I have 27,644 genes I need to do the test for. I doubt even on a big cluster that I'll be able to get results in a reasonable amount of time.
Question 2: Does anybody have any suggestions or alternatives?
Also, as a side note, I'm running the p-value calculations in MATLAB using DCFisherextest.m, which runs 2x2 contingency tables using the approximation log10(x!) =~ gammaln(x+1)/log(10), which is significantly faster and more accurate than direct factorials (MATLAB does not take anything higher than 170, and my data is several orders of magnitude higher).
Thank you in advance!
Comment