Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
 Similar Threads Thread Thread Starter Forum Replies Last Post madsaan Bioinformatics 3 12-12-2012 04:14 PM mrfox Bioinformatics 5 10-17-2012 01:10 PM combiochem Bioinformatics 12 10-13-2012 11:37 PM john_nl Bioinformatics 5 02-15-2012 11:28 PM repinementer Bioinformatics 2 11-18-2010 09:51 PM

 10-01-2011, 10:07 AM #1 Artur Jaroszewicz Member   Location: Los Angeles Join Date: Sep 2011 Posts: 45 Calculating p-values from FPKM? Oof! I'm on the last leg of my first project here at UCLA in the Pellegrini Lab. I'm supposed to figure out which genes are differentially expressed (say, with a p-value < 0.001). I used Fisher's Exact Test to calculate directly from the FPKM (Fragments Per Kilobase of gene per Million reads), and the calculation took about 5 seconds on my laptop, and I got very significant results. Very disturbing that it was so easy. So I started getting this inkling that maaaaybe I can't do the FET on FPKMs, as each gene is normalized by its own length, so the totals of the FPKMs fall out of proportion. Question 1: Am I correct in thinking this? If answer(Question 1) == yes, I tried running the FET on the actual counts of each gene per million reads, and overnight, my computer calculated about 60 of them. The problem is, I have 27,644 genes I need to do the test for. I doubt even on a big cluster that I'll be able to get results in a reasonable amount of time. Question 2: Does anybody have any suggestions or alternatives? Also, as a side note, I'm running the p-value calculations in MATLAB using DCFisherextest.m, which runs 2x2 contingency tables using the approximation log10(x!) =~ gammaln(x+1)/log(10), which is significantly faster and more accurate than direct factorials (MATLAB does not take anything higher than 170, and my data is several orders of magnitude higher). Thank you in advance!
 10-01-2011, 10:20 AM #2 Artur Jaroszewicz Member   Location: Los Angeles Join Date: Sep 2011 Posts: 45 Not exactly an update, but I've posted the script to our public server. You can retrieve it at http://pellegrini.mcdb.ucla.edu/Public/DCFisherextest.m
 10-01-2011, 11:43 AM #3 stoker Member   Location: Poland Join Date: Oct 2010 Posts: 17 It would be nice if you could write the number of samples you have - then it would be easier to say if the test you apply is proper. Differential fpkm analysis does not take much time - similarly as microarray differential analysis. Do you calculate FPKM values with cufflinks? My suggestion would be to apply Student's ttest, with some kind of correction - taking into account both - "0" FPKMs for some genes and fold changes of particular genes. This would be ok if you have, lets say 8 versus 8 samples at least. Maybe someone has better ideas? I would gladly take into consideration some suggestions on this topic. __________________ Tomasz Stokowy www.sequencing.io.gliwice.pl
 10-02-2011, 12:14 AM #4 Artur Jaroszewicz Member   Location: Los Angeles Join Date: Sep 2011 Posts: 45 Yes, I calculated the FPKM with cufflinks. It's actually only two samples. I was afraid that I might need a very powerful significance test, but my department head actually just suggested doing a Poisson Test -- taking the lower expression of a single gene, calculating the Poisson parameter, and finding the likelihood of the larger expression of that gene. Student's test, though? What kind of correction are you talking about with the fold changes? I'm curious as to how that can be applied. Thanks for the reply!
 10-02-2011, 01:55 AM #5 stoker Member   Location: Poland Join Date: Oct 2010 Posts: 17 Since you have only 2 samples, I think there is no place for statistics - you can not discuss about any population phenomena just on the base of 2 samples. What you can do best in my opinion is just calculation of fold changes. This will lead you to conclusion "expression of gene in a sample is higher than in the other one". To start any statistics i would suggest to have at least 5vs5 for "small number of samples tests, FET or Dixon's" and 8vs8 after outliers filtering for other statistics. Tomasz __________________ Tomasz Stokowy www.sequencing.io.gliwice.pl
 10-02-2011, 08:37 AM #6 Simon Anders Senior Member   Location: Heidelberg, Germany Join Date: Feb 2010 Posts: 994 Despite the fact that it is done often, Fisher's exact test is unsuitable to test for differential expression, for reasons we've discussed in more than one thread here. However, using Fisher's test on FPKM values rather than raw counts is, in a way, doubly wrong, because the whole point of Fisher's test, where appropriate, is to account for counting noise. So, just out of curiosity: where did you get this idea from. On a more constructive side, here are some references how to do it: Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881-2887 Simon Anders and Wolfgang Huber (2010): Differential expression analysis for sequence count data. Genome Biology 11:R106 Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140 Thomas J Hardcastle and Krystyna A Kelly (2010): baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11:422
10-02-2011, 10:29 AM   #7
Artur Jaroszewicz
Member

Location: Los Angeles

Join Date: Sep 2011
Posts: 45

Quote:
 Since you have only 2 samples, I think there is no place for statistics - you can not discuss about any population phenomena just on the base of 2 samples. What you can do best in my opinion is just calculation of fold changes. This will lead you to conclusion "expression of gene in a sample is higher than in the other one".
Well, I'm tempted to disagree with you on this point. The Illumina does so many reads, it's almost impossible not to derive and statistics from two samples. And yes, "expression of gene in a sample is higher than in the other one"... You're right, we need to be very careful as to what we do here. Differential expression does not exactly say how much of a difference, but I think we can pretty safely assume that it can be taken directly from the fold changes, as long as the expressions of the genes are not so low as to have a high p-value.

Quote:
 Despite the fact that it is done often, Fisher's exact test is unsuitable to test for differential expression, for reasons we've discussed in more than one thread here.
Thank you. I've found the thread and am studying it now.

Quote:
 However, using Fisher's test on FPKM values rather than raw counts is, in a way, doubly wrong, because the whole point of Fisher's test, where appropriate, is to account for counting noise. So, just out of curiosity: where did you get this idea from.
Fisher's Exact Test is what we have been using by default in our lab. Ultimately, I don't really see why it's unsuitable. It bypasses all the worry of having to calculate a variance (how do you get a good value with two samples?). I do understand, however, that it while it will give you a sense of whether something is expressed differently, it doesn't say HOW differently. On the other hand, you can -essentially- determine that from the p-value, plus you can easily find the fold change.
I definitely agree with you on the using of raw counts rather than FPKM values. As long as you use a fitting test, it shouldn't matter whether you use raw counts or counts per million.

Thank you both for your help!

Artur

10-02-2011, 10:58 AM   #8
Simon Anders
Senior Member

Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994

Quote:
 Fisher's Exact Test is what we have been using by default in our lab.
Your lab is not the only one who does this, but I sincerely hope referees and editors will finally learn to reject all these papers. Most of them do, by now, as the news that it is wrong seem to finally spread.

Quote:
 It bypasses all the worry of having to calculate a variance (how do you get a good value with two samples?)
But this is exactly the problem. Fisher's test does not bypass this problem. It explicitly assumes that the dispersion (the coefficient of extra-Poisson variation) is zero, and this is inappropriate. It's only because this zero does not appear in the formula that people assume that the test does not need a variance estimation; it does: unless you know the dispersion to be zero, you cannot use the test. The tests mentioned in the papers above become (approximately) equal to the Fisher's test once you set the dispersion to zero there.

See this paper for a more thorough explanation:

Baggerly, K. A., Deng, L., Morris, J. S. & Aldaz, C. M. 2003 Differential expres-
sion in SAGE: accounting for normal between-library variation. Bioinformatics,
19(12), 1477–1483. (doi:10.1093/bioinformatics/btg173)

Quote:
 ... it doesn't say HOW differently. On the other hand, you can -essentially- determine that from the p-value, plus you can easily find the fold change.
A p value should never be used as in indicator of the magnitude of an effect. Significance and effect size are not the same. Use the fold change, or some shrinkage estimate of it.

 10-13-2011, 09:46 AM #9 Artur Jaroszewicz Member   Location: Los Angeles Join Date: Sep 2011 Posts: 45 After speaking to my professor.. Of course you need to calculate p-values with the FPKMs! I am rather new in this field, and did not realize that the RNA strands are 'chopped up' before they're sequence. Makes total sense. Simon, thank you for all your help. You are totally right about the p-values vs effect (fold change). Do you recommend using a Poisson test instead of FET?
 10-13-2011, 09:48 AM #10 Artur Jaroszewicz Member   Location: Los Angeles Join Date: Sep 2011 Posts: 45 Also, I am a little bit at a loss as to what to do after finding differential expression of genes, p-values, and maybe some pathways.. Does anybody have a link to a thread about what to do after those analyses are done? I've been asked to do a heat map. That shouldn't be too hard. What else?
10-14-2011, 12:42 PM   #11
Simon Anders
Senior Member

Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994

Quote:
 Originally Posted by Artur Jaroszewicz After speaking to my professor.. Of course you need to calculate p-values with the FPKMs! I am rather new in this field, and did not realize that the RNA strands are 'chopped up' before they're sequence. Makes total sense. Simon, thank you for all your help. You are totally right about the p-values vs effect (fold change). Do you recommend using a Poisson test instead of FET?
To repeat:

1. Both Fisher's exact test and the Poisson likelihood ratio test are based on the assumption that they are given raw integer counts. Using them with FPKM values is incorrect.

2. Both these tests neglect overdispersion (sample-to-sample variation not due to counting noise but due to actual differences between replicate sample). Hence, neither of them should be used for differential expression analysis. (In fact, the Poisson LRT can be seen as an approxmiation to the FET.)

For explanations of these points, please consult a stochastics textbooks, or, for the second point, see this paper:

Baggerly, K. A., Deng, L., Morris, J. S. & Aldaz, C. M. (2003):
Differential expression in SAGE: accounting for normal between-library variation.
Bioinformatics, 19(12), 1477–1483. (doi:10.1093/bioinformatics/btg173)

For correct methods, see the papers listed in post #6.

 10-06-2012, 09:01 PM #12 aslihan Member   Location: USA Join Date: Jun 2011 Posts: 23 Hello, I created log2(FPKM tissueA/ FPKM tissueB) table for control 6sample and disease 7samples and would like to see significant differential genes. log2(fpkmA/fpkmB) CTRL1.....6 log2(fpkmA/fpkmB) DISEASE1....7 gene1 gene2 gene3 ...... Which statistical test should I apply for my case? Thank you very much.
 10-06-2012, 09:19 PM #13 Artur Jaroszewicz Member   Location: Los Angeles Join Date: Sep 2011 Posts: 45 Hi aslihan, I return to this thread a completely different man. Thank you Simon Anders for helping me earlier. I actually use both HTSeq and DESeq in my analysis now. What package are you using to calculate your RPKM/FPKM values? If you are using Cufflinks, you'd probably want to use CuffDiff/CuffCompare to calculate the p-values. If you have a pipeline that doesn't employ Cufflinks, you may want to try Simon Anders' (above) DESeq or Robinson, et. al's EdgeR. DESeq is nice if you have replicates, but returns rather conservative p-values when there's only one replicate. People say EdgeR is better for handling single replicates, but I have not used it.
10-17-2012, 01:00 PM   #14
IBseq
Member

Location: uk

Join Date: Jul 2012
Posts: 56
Calculating p-values from FPKM and more...

Quote:
 Originally Posted by Artur Jaroszewicz Hi aslihan, I return to this thread a completely different man. Thank you Simon Anders for helping me earlier. I actually use both HTSeq and DESeq in my analysis now. What package are you using to calculate your RPKM/FPKM values? If you are using Cufflinks, you'd probably want to use CuffDiff/CuffCompare to calculate the p-values. If you have a pipeline that doesn't employ Cufflinks, you may want to try Simon Anders' (above) DESeq or Robinson, et. al's EdgeR. DESeq is nice if you have replicates, but returns rather conservative p-values when there's only one replicate. People say EdgeR is better for handling single replicates, but I have not used it.

Hi Artur,
I read with interest your last conversation abt the statistic to be used.
I am new to this filed and very new to the statistic field, thus I am completely lost!

here my issue: I have two samples, treated and untreated. I have done the illumina sequencing and then analysed the data via the tophat/cufflinks/cuffcompare/cuffdiff pipeline.

My question: how to find the de genes?I was suggesrted that in absebce of replicates it does not make sense to do a test statistic, but CUFFDIFF DOES. does it mean that the results are not reliable?

Shall I rank all my genes by fold change and then take top one as most differentially expressed?But then, how can i know if this difference is significant?

also, some genes have fpkm=0....the fold change is sometimes very very high, but this does not reflect the truth...how to deal with this??

Last question: when you run cuffcompare, do you give as input both cufflinks .gtf files from yuur treated and untreated (+/_ a reference genome as you wish) or just one .gtf files?

thanks in advance for any help (specially the statistical part!)

ib

 10-17-2012, 02:36 PM #15 Artur Jaroszewicz Member   Location: Los Angeles Join Date: Sep 2011 Posts: 45 Hi IBseq, Even though CuffDiff may give p-values, these are not really to be trusted. It doesn't make sense to compare simply one replicate of each. Here's an analogy: proving cats run faster than dogs by taking one cat and one dog, and racing them. Yes, it is an overly simplified, but you get the idea. You have no way of knowing if a difference is due to natural biological variation or due to differences between the samples. The best you can do is sort by fold change or by top RPKM, and look at differences there. If you decide that you do want to do differential testing, you should be using (in your pipeline) cufflinks -> cuffcompare -> cuffdiff. Cuffcompare basically creates a single combined gtf file which you can use for cuffdiff. Use this only if you are doing a de novo alignment. If you're using a reference gtf file, use that. If you have an RPKM (or 'FPKM') of zero, it means it's not expressed, so you have an infinite fold change. I recommend using a log base 2 ratio, but adding, e.g., 0.001 to each value, to make sure you don't take the log of zero or dividing by zero. Best of luck, Artur
 10-18-2012, 01:50 AM #16 IBseq Member   Location: uk Join Date: Jul 2012 Posts: 56 Hi Artur, yes, that makes sense. Thus I shall not take in consideration the p/q values and their consequent significance. I did follow the tophat/cufflinks/cuffcompare/cuffdiff pipeline, but I had a doubt and ran it twice with a difference. I shall outline that I have one control sample and 7 treated samples but I am not using them as replicates, thus I ran them separately each of them against the control. 1st run: tophat.cufflinks with refence annotation; then cuffcompare with the cufflinks of all 7 samples and control sample plus reference annotation. I used this output of couffcompare as input for 7 cuffdiffs (one for each of the 7 treated samples). 2nd run: tophat/cufflinks with reference annotation, then cuffcompare with cufflinks of each of the treated samples with reference annotation (thus I have 7 cuffcompare outputs). I used one by one each of these cuffcompare outputs to run 7 cuffdiffs. My question: I have more consistent results when I run a cuffcompare which includes all my cufflinks (control + 7 treated) rather than cifflink control+cufflink treated one by one. Similarly, I do have more interesting results if I run 7 different cuffdiff rather thn one cuffdiff by taking my 7 treated samples as replicates (these 7 samples are 7 people biopsis of the same tumor) which way is best? I can use the first run and take it as reliable? sorry for long question...Hope you can help. thanks, ib

 Tags fisher test, fisher's exact test, fpkm