![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can I test for differential expression using FPKM values? | JonB | Bioinformatics | 7 | 03-05-2018 03:11 AM |
NOISeq with fpkm values | NitaC | Bioinformatics | 5 | 07-12-2014 06:11 AM |
Cufflinks 0 FPKM values | herstein | Bioinformatics | 2 | 07-24-2013 11:21 PM |
Cuffdiff FPKM and test statistic calculations | PRingler | RNA Sequencing | 2 | 10-16-2012 03:47 AM |
FPKM values are zero | budgie lover | Bioinformatics | 1 | 09-12-2012 05:54 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Barcelona Join Date: Jan 2014
Posts: 16
|
![]()
I have two sets of genes, and I'd like to have a boxplot and do a t-test in order to know if they have significantly different expressions or not.
However, my t-test p-value changes when using log10(FPKM+1) values or just FPKM values. Why? What should I choose? Thanks. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Marburg, Germany Join Date: Oct 2009
Posts: 110
|
![]()
A t-test is dependend on the effect size - and that obviously changes if you do log2.
The general rule is to test on the data you measure - in this case, this would be the un-logged reads per million. Either way: You should not be testing on the FPKM values, in summary because you loose the information about the no of reads actually behind the value -> more reads -> a better estimate. Consider using a testing method specifically for RNAseq data such as DESeq. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Stanford Join Date: Jun 2009
Posts: 181
|
![]()
FPKM is just an intuitive transformation of fragment counts and is not suitable to be used in statistics.
Fortunately, the software package that probably gave you the FPKM values, Cufflinks, also includes a program called cuffdiff that will do the test you want to do in a statistically rigorous way based on modeling the actual fragment counts. Use that instead; don't try to do use statistical tests that are unsuited for your data type on data that are unsuited for statistics. |
![]() |
![]() |
![]() |
#4 |
Member
Location: Barcelona Join Date: Jan 2014
Posts: 16
|
![]()
I do not need specific RNA-seq normalization here for what I want. Both sets of genes (actually I have transcripts) come from the same RNA-seq dataset (the same fasta). One dataset is made up of coding transcripts and the second one is made up of putative lncRNAs. I just wanna know which set or group of transcripts is more expressed.
What is your final conclusion¿ Last edited by int11ap1; 07-17-2014 at 12:14 PM. |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Stanford Join Date: Jun 2009
Posts: 181
|
![]()
My final conclusion is the same as before: you should use a valid hypothesis test on the count data, like cuffdiff, DESeq2, or edgeR, all of which are quite rigorous, commonly used, and well documented. Do not use an invalid hypothesis test on FPKMs. FPKM is a crude normalization and cannot be used in a meaningful statistical test. Asking us again is not going to change the way numbers work.
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Barcelona Join Date: Jan 2014
Posts: 16
|
![]()
But those methods that you say (edgeR and DESeq) are for normalization between different samples or RNA-seq datasets...
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Stanford Join Date: Jun 2009
Posts: 181
|
![]()
No, you have it backwards: those methods are all for statistical hypothesis testing, and FPKM is a (crude, statistically inappropriate) normalization for comparing different samples.
|
![]() |
![]() |
![]() |
#8 |
Member
Location: Barcelona Join Date: Jan 2014
Posts: 16
|
![]()
I do not follow you, sorry for asking again.
For example, I have 1000 FPKM values (from 1 RNA-seq sample) from 1000 transcripts. If I want to compare first 500 with second 500 transcripts (for seeing which set is more expressed), I need to use edgeR or DESseq¿ For what¿ |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Stanford Join Date: Jun 2009
Posts: 181
|
![]()
Ah, I see: you're comparing some genes with other genes in the same experiment, not same gene different experiment.
You can use FPKM values for this if you use a distribution-free test like Mann-Whitney-Wilcoxon, but that won't be very powerful. Otherwise you could use a more effective normalization like the variance-stabilizing transformation or regularized log in DESeq2 and then use a regular t-test. |
![]() |
![]() |
![]() |
#10 |
Member
Location: Barcelona Join Date: Jan 2014
Posts: 16
|
![]()
Here you are, thanks¡
Why do not apply directly the t-test¿ Where can I learn about it¿ |
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Stanford Join Date: Jun 2009
Posts: 181
|
![]()
The t-test assumes the populations are normally distributed. FPKMs are not. http://en.wikipedia.org/wiki/Student's_t-test
A log transformation may seem to help but it is still inappropriate because it fails to account for the heteroskedastic mean-variance dependency of read counts. DOI: 10.1111/j.2041-210X.2010.00021.x |
![]() |
![]() |
![]() |
#12 |
Member
Location: Barcelona Join Date: Jan 2014
Posts: 16
|
![]()
But the arithmetic mean of my FPKM values will be normally distributed according to the central limit theorem. In large samples such as mine, t.test for skewed distributions should be fine: http://stats.stackexchange.com/quest...ormal-when-n50
|
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Stanford Join Date: Jun 2009
Posts: 181
|
![]()
Okay, you could do a normality test to verify that the t-test assumptions are met, but it would be more straightforward and rigorous to just use a better normalization.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|