SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Different fpkm values for cuffdiff and cuffcompare madsaan Bioinformatics 3 12-12-2012 04:14 PM
Different FPKM values of cufflinks and cuffdiff mrfox Bioinformatics 5 10-17-2012 01:10 PM
Cufflinks and cuffdiff FPKM values combiochem Bioinformatics 12 10-13-2012 11:37 PM
Combining FPKM values for a gene john_nl Bioinformatics 5 02-15-2012 11:28 PM
FPKM minimum and maximum values ? repinementer Bioinformatics 2 11-18-2010 09:51 PM

Reply
 
Thread Tools
Old 10-01-2011, 10:07 AM   #1
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Question Calculating p-values from FPKM?

Oof! I'm on the last leg of my first project here at UCLA in the Pellegrini Lab. I'm supposed to figure out which genes are differentially expressed (say, with a p-value < 0.001). I used Fisher's Exact Test to calculate directly from the FPKM (Fragments Per Kilobase of gene per Million reads), and the calculation took about 5 seconds on my laptop, and I got very significant results. Very disturbing that it was so easy.

So I started getting this inkling that maaaaybe I can't do the FET on FPKMs, as each gene is normalized by its own length, so the totals of the FPKMs fall out of proportion.

Question 1: Am I correct in thinking this?

If answer(Question 1) == yes,
I tried running the FET on the actual counts of each gene per million reads, and overnight, my computer calculated about 60 of them. The problem is, I have 27,644 genes I need to do the test for. I doubt even on a big cluster that I'll be able to get results in a reasonable amount of time.

Question 2: Does anybody have any suggestions or alternatives?

Also, as a side note, I'm running the p-value calculations in MATLAB using DCFisherextest.m, which runs 2x2 contingency tables using the approximation log10(x!) =~ gammaln(x+1)/log(10), which is significantly faster and more accurate than direct factorials (MATLAB does not take anything higher than 170, and my data is several orders of magnitude higher).

Thank you in advance!
Artur Jaroszewicz is offline   Reply With Quote
Old 10-01-2011, 10:20 AM   #2
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Not exactly an update, but I've posted the script to our public server. You can retrieve it at http://pellegrini.mcdb.ucla.edu/Public/DCFisherextest.m
Artur Jaroszewicz is offline   Reply With Quote
Old 10-01-2011, 11:43 AM   #3
stoker
Member
 
Location: Poland

Join Date: Oct 2010
Posts: 17
Default

It would be nice if you could write the number of samples you have - then it would be easier to say if the test you apply is proper. Differential fpkm analysis does not take much time - similarly as microarray differential analysis.

Do you calculate FPKM values with cufflinks?
My suggestion would be to apply Student's ttest, with some kind of correction - taking into account both - "0" FPKMs for some genes and fold changes of particular genes.

This would be ok if you have, lets say 8 versus 8 samples at least.

Maybe someone has better ideas? I would gladly take into consideration some suggestions on this topic.
__________________
Tomasz Stokowy
www.sequencing.io.gliwice.pl
stoker is offline   Reply With Quote
Old 10-02-2011, 12:14 AM   #4
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Yes, I calculated the FPKM with cufflinks. It's actually only two samples. I was afraid that I might need a very powerful significance test, but my department head actually just suggested doing a Poisson Test -- taking the lower expression of a single gene, calculating the Poisson parameter, and finding the likelihood of the larger expression of that gene.

Student's test, though? What kind of correction are you talking about with the fold changes? I'm curious as to how that can be applied. Thanks for the reply!
Artur Jaroszewicz is offline   Reply With Quote
Old 10-02-2011, 01:55 AM   #5
stoker
Member
 
Location: Poland

Join Date: Oct 2010
Posts: 17
Default

Since you have only 2 samples, I think there is no place for statistics - you can not discuss about any population phenomena just on the base of 2 samples. What you can do best in my opinion is just calculation of fold changes. This will lead you to conclusion "expression of gene in a sample is higher than in the other one". To start any statistics i would suggest to have at least 5vs5 for "small number of samples tests, FET or Dixon's" and 8vs8 after outliers filtering for other statistics.

Tomasz
__________________
Tomasz Stokowy
www.sequencing.io.gliwice.pl
stoker is offline   Reply With Quote
Old 10-02-2011, 08:37 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Despite the fact that it is done often, Fisher's exact test is unsuitable to test for differential expression, for reasons we've discussed in more than one thread here.

However, using Fisher's test on FPKM values rather than raw counts is, in a way, doubly wrong, because the whole point of Fisher's test, where appropriate, is to account for counting noise. So, just out of curiosity: where did you get this idea from.

On a more constructive side, here are some references how to do it:

Robinson MD and Smyth GK (2007). Moderated statistical tests for
assessing differences in tag abundance. Bioinformatics 23, 2881-2887

Simon Anders and Wolfgang Huber (2010): Differential expression
analysis for sequence count data. Genome Biology 11:R106

Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor
package for differential expression analysis of digital gene
expression data. Bioinformatics 26, 139-140

Thomas J Hardcastle and Krystyna A Kelly (2010): baySeq: Empirical
Bayesian methods for identifying differential expression in sequence
count data. BMC Bioinformatics 11:422
Simon Anders is offline   Reply With Quote
Old 10-02-2011, 10:29 AM   #7
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Quote:
Since you have only 2 samples, I think there is no place for statistics - you can not discuss about any population phenomena just on the base of 2 samples. What you can do best in my opinion is just calculation of fold changes. This will lead you to conclusion "expression of gene in a sample is higher than in the other one".
Well, I'm tempted to disagree with you on this point. The Illumina does so many reads, it's almost impossible not to derive and statistics from two samples. And yes, "expression of gene in a sample is higher than in the other one"... You're right, we need to be very careful as to what we do here. Differential expression does not exactly say how much of a difference, but I think we can pretty safely assume that it can be taken directly from the fold changes, as long as the expressions of the genes are not so low as to have a high p-value.

Quote:
Despite the fact that it is done often, Fisher's exact test is unsuitable to test for differential expression, for reasons we've discussed in more than one thread here.
Thank you. I've found the thread and am studying it now.

Quote:
However, using Fisher's test on FPKM values rather than raw counts is, in a way, doubly wrong, because the whole point of Fisher's test, where appropriate, is to account for counting noise. So, just out of curiosity: where did you get this idea from.
Fisher's Exact Test is what we have been using by default in our lab. Ultimately, I don't really see why it's unsuitable. It bypasses all the worry of having to calculate a variance (how do you get a good value with two samples?). I do understand, however, that it while it will give you a sense of whether something is expressed differently, it doesn't say HOW differently. On the other hand, you can -essentially- determine that from the p-value, plus you can easily find the fold change.
I definitely agree with you on the using of raw counts rather than FPKM values. As long as you use a fitting test, it shouldn't matter whether you use raw counts or counts per million.

Thank you both for your help!

Artur
Artur Jaroszewicz is offline   Reply With Quote
Old 10-02-2011, 10:58 AM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Fisher's Exact Test is what we have been using by default in our lab.
Your lab is not the only one who does this, but I sincerely hope referees and editors will finally learn to reject all these papers. Most of them do, by now, as the news that it is wrong seem to finally spread.

Quote:
It bypasses all the worry of having to calculate a variance (how do you get a good value with two samples?)
But this is exactly the problem. Fisher's test does not bypass this problem. It explicitly assumes that the dispersion (the coefficient of extra-Poisson variation) is zero, and this is inappropriate. It's only because this zero does not appear in the formula that people assume that the test does not need a variance estimation; it does: unless you know the dispersion to be zero, you cannot use the test. The tests mentioned in the papers above become (approximately) equal to the Fisher's test once you set the dispersion to zero there.

See this paper for a more thorough explanation:

Baggerly, K. A., Deng, L., Morris, J. S. & Aldaz, C. M. 2003 Differential expres-
sion in SAGE: accounting for normal between-library variation. Bioinformatics,
19(12), 1477–1483. (doi:10.1093/bioinformatics/btg173)

Quote:
... it doesn't say HOW differently. On the other hand, you can -essentially- determine that from the p-value, plus you can easily find the fold change.
A p value should never be used as in indicator of the magnitude of an effect. Significance and effect size are not the same. Use the fold change, or some shrinkage estimate of it.
Simon Anders is offline   Reply With Quote
Old 10-13-2011, 09:46 AM   #9
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

After speaking to my professor..
Of course you need to calculate p-values with the FPKMs! I am rather new in this field, and did not realize that the RNA strands are 'chopped up' before they're sequence. Makes total sense.
Simon, thank you for all your help. You are totally right about the p-values vs effect (fold change). Do you recommend using a Poisson test instead of FET?
Artur Jaroszewicz is offline   Reply With Quote
Old 10-13-2011, 09:48 AM   #10
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Also, I am a little bit at a loss as to what to do after finding differential expression of genes, p-values, and maybe some pathways.. Does anybody have a link to a thread about what to do after those analyses are done?
I've been asked to do a heat map. That shouldn't be too hard. What else?
Artur Jaroszewicz is offline   Reply With Quote
Old 10-14-2011, 12:42 PM   #11
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by Artur Jaroszewicz View Post
After speaking to my professor..
Of course you need to calculate p-values with the FPKMs! I am rather new in this field, and did not realize that the RNA strands are 'chopped up' before they're sequence. Makes total sense.
Simon, thank you for all your help. You are totally right about the p-values vs effect (fold change). Do you recommend using a Poisson test instead of FET?
To repeat:

1. Both Fisher's exact test and the Poisson likelihood ratio test are based on the assumption that they are given raw integer counts. Using them with FPKM values is incorrect.

2. Both these tests neglect overdispersion (sample-to-sample variation not due to counting noise but due to actual differences between replicate sample). Hence, neither of them should be used for differential expression analysis. (In fact, the Poisson LRT can be seen as an approxmiation to the FET.)

For explanations of these points, please consult a stochastics textbooks, or, for the second point, see this paper:

Baggerly, K. A., Deng, L., Morris, J. S. & Aldaz, C. M. (2003):
Differential expression in SAGE: accounting for normal between-library variation.
Bioinformatics, 19(12), 14771483. (doi:10.1093/bioinformatics/btg173)

For correct methods, see the papers listed in post #6.
Simon Anders is offline   Reply With Quote
Old 10-06-2012, 09:01 PM   #12
aslihan
Member
 
Location: USA

Join Date: Jun 2011
Posts: 23
Default

Hello, I created log2(FPKM tissueA/ FPKM tissueB) table for control 6sample and disease 7samples and would like to see significant differential genes.

log2(fpkmA/fpkmB) CTRL1.....6 log2(fpkmA/fpkmB) DISEASE1....7

gene1
gene2
gene3

......

Which statistical test should I apply for my case?

Thank you very much.
aslihan is offline   Reply With Quote
Old 10-06-2012, 09:19 PM   #13
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Hi aslihan,

I return to this thread a completely different man. Thank you Simon Anders for helping me earlier. I actually use both HTSeq and DESeq in my analysis now.

What package are you using to calculate your RPKM/FPKM values? If you are using Cufflinks, you'd probably want to use CuffDiff/CuffCompare to calculate the p-values. If you have a pipeline that doesn't employ Cufflinks, you may want to try Simon Anders' (above) DESeq or Robinson, et. al's EdgeR. DESeq is nice if you have replicates, but returns rather conservative p-values when there's only one replicate. People say EdgeR is better for handling single replicates, but I have not used it.
Artur Jaroszewicz is offline   Reply With Quote
Old 10-17-2012, 01:00 PM   #14
IBseq
Member
 
Location: uk

Join Date: Jul 2012
Posts: 56
Default Calculating p-values from FPKM and more...

Quote:
Originally Posted by Artur Jaroszewicz View Post
Hi aslihan,

I return to this thread a completely different man. Thank you Simon Anders for helping me earlier. I actually use both HTSeq and DESeq in my analysis now.

What package are you using to calculate your RPKM/FPKM values? If you are using Cufflinks, you'd probably want to use CuffDiff/CuffCompare to calculate the p-values. If you have a pipeline that doesn't employ Cufflinks, you may want to try Simon Anders' (above) DESeq or Robinson, et. al's EdgeR. DESeq is nice if you have replicates, but returns rather conservative p-values when there's only one replicate. People say EdgeR is better for handling single replicates, but I have not used it.

Hi Artur,
I read with interest your last conversation abt the statistic to be used.
I am new to this filed and very new to the statistic field, thus I am completely lost!

here my issue: I have two samples, treated and untreated. I have done the illumina sequencing and then analysed the data via the tophat/cufflinks/cuffcompare/cuffdiff pipeline.

My question: how to find the de genes?I was suggesrted that in absebce of replicates it does not make sense to do a test statistic, but CUFFDIFF DOES. does it mean that the results are not reliable?

Shall I rank all my genes by fold change and then take top one as most differentially expressed?But then, how can i know if this difference is significant?

also, some genes have fpkm=0....the fold change is sometimes very very high, but this does not reflect the truth...how to deal with this??

Last question: when you run cuffcompare, do you give as input both cufflinks .gtf files from yuur treated and untreated (+/_ a reference genome as you wish) or just one .gtf files?

thanks in advance for any help (specially the statistical part!)

ib
IBseq is offline   Reply With Quote
Old 10-17-2012, 02:36 PM   #15
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Hi IBseq,

Even though CuffDiff may give p-values, these are not really to be trusted. It doesn't make sense to compare simply one replicate of each. Here's an analogy: proving cats run faster than dogs by taking one cat and one dog, and racing them. Yes, it is an overly simplified, but you get the idea. You have no way of knowing if a difference is due to natural biological variation or due to differences between the samples. The best you can do is sort by fold change or by top RPKM, and look at differences there.

If you decide that you do want to do differential testing, you should be using (in your pipeline) cufflinks -> cuffcompare -> cuffdiff. Cuffcompare basically creates a single combined gtf file which you can use for cuffdiff. Use this only if you are doing a de novo alignment. If you're using a reference gtf file, use that.

If you have an RPKM (or 'FPKM') of zero, it means it's not expressed, so you have an infinite fold change. I recommend using a log base 2 ratio, but adding, e.g., 0.001 to each value, to make sure you don't take the log of zero or dividing by zero.

Best of luck,

Artur
Artur Jaroszewicz is offline   Reply With Quote
Old 10-18-2012, 01:50 AM   #16
IBseq
Member
 
Location: uk

Join Date: Jul 2012
Posts: 56
Default

Hi Artur,
yes, that makes sense. Thus I shall not take in consideration the p/q values and their consequent significance.

I did follow the tophat/cufflinks/cuffcompare/cuffdiff pipeline, but I had a doubt and ran it twice with a difference. I shall outline that I have one control sample and 7 treated samples but I am not using them as replicates, thus I ran them separately each of them against the control.

1st run: tophat.cufflinks with refence annotation; then cuffcompare with the cufflinks of all 7 samples and control sample plus reference annotation. I used this output of couffcompare as input for 7 cuffdiffs (one for each of the 7 treated samples).

2nd run: tophat/cufflinks with reference annotation, then cuffcompare with cufflinks of each of the treated samples with reference annotation (thus I have 7 cuffcompare outputs). I used one by one each of these cuffcompare outputs to run 7 cuffdiffs.

My question: I have more consistent results when I run a cuffcompare which includes all my cufflinks (control + 7 treated) rather than cifflink control+cufflink treated one by one. Similarly, I do have more interesting results if I run 7 different cuffdiff rather thn one cuffdiff by taking my 7 treated samples as replicates (these 7 samples are 7 people biopsis of the same tumor)

which way is best? I can use the first run and take it as reliable?

sorry for long question...Hope you can help.

thanks,
ib
IBseq is offline   Reply With Quote
Old 10-25-2012, 12:04 PM   #17
tankman
Member
 
Location: usa

Join Date: Sep 2012
Posts: 22
Default interpretation of FPKM tracking files at gene level

hi guys,

quick question for you cuffdiff-output experts:

I'm having trouble understanding why the gene level fpkm tracking file shouldn't be unique in the gene short name. That's what I thought the description entailed (sums over isoforms to get gene level FPKM matrix) from the cufflinks website:

genes.fpkm_tracking Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene_id


Instead, e.g., I get

users-mac-pro:cuffdiff_SE_plus_PE_trimmed_all_samples user$ grep ADAR genes.fpkm_tracking | cut -f 1-6
XLOC_000826 - - XLOC_000826 ADAR TSS1470
XLOC_002045 - - XLOC_002045 ADAR TSS3725,TSS3726,TSS3727,TSS3728
XLOC_003087 - - XLOC_003087 ADARB2 TSS5153
XLOC_003599 - - XLOC_003599 ADARB2 TSS6095,TSS6096,TSS6097,TSS6098
XLOC_019202 - - XLOC_019202 ADARB1 TSS32453,TSS32454,TSS32455
XLOC_019369 - - XLOC_019369 ADARB1 TSS32746


All the TSS id's correspond to the same physical locus. I am misunderstanding why gene short id is different than gene id. Can you please help me interpret this data?

thanks,
tankman
tankman is offline   Reply With Quote
Reply

Tags
fisher test, fisher's exact test, fpkm

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:40 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO