SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cuffdiff - results replicates adrian Bioinformatics 4 02-24-2013 05:04 PM
cuffdiff p value for 2 conditions without replicates hug2001 Bioinformatics 17 01-24-2013 04:59 AM
cuffdiff: # SD genes come down with more replicates foxriverlin RNA Sequencing 0 06-21-2012 09:02 AM
Cuffdiff-- replicates guzhi100 RNA Sequencing 1 05-06-2012 12:30 PM
using Cuffdiff with biological replicates Jane M RNA Sequencing 0 09-01-2011 12:42 AM

Reply
 
Thread Tools
Old 03-27-2013, 12:29 AM   #1
dlawrence
Junior Member
 
Location: South Australia

Join Date: Mar 2012
Posts: 8
Default Cuffdiff p-value distributions and replicates

Hi, has anyone looked at the distribution of Cuffdiff's p-values?

As I understand it, p-values should be evenly distributed (ie flat histogram) for the null hypothesis, which should be the case for most genomics data (as only a very small number of genes change between control/treatment).

Here are p-values from some of my cuffdiff gene_exp.diff files:



I noticed that there seems to be a pattern if you group by how many replicates you pass in:



I also re-ran a 3 rep sample with just 1 rep per condition and ended up with a similar graph to the 1-rep ones above.

Looking at cuffdiff's differential.cpp:

Code:
            double curr_log_fpkm_var = (curr.FPKM_variance) / (curr.FPKM * curr.FPKM);
            double prev_log_fpkm_var = (prev.FPKM_variance) / (prev.FPKM * prev.FPKM);
            
            double numerator = log(prev.FPKM / curr.FPKM);
            
            double denominator = sqrt(prev_log_fpkm_var + curr_log_fpkm_var);
            stat = numerator / denominator;
It seems that more variance -> higher p-values, so if more reps = higher variance, this is shifting the values away from 0...

Q: Is there something wrong here? (with my samples or cuffdiff's treatment of p-values?)

I'm using Cuffdiff 2.0.1

I'm only graphing genes where status == OK, and remove those with p-value=1, which comes from a different path in differential.cpp and doesn't use the t-tests. The graph code is basically:

Code:
    gene_exp_df = pd.DataFrame.from_csv(gene_exp, sep='\t')
    status_ok_df = gene_exp_df[gene_exp_df['status'] == 'OK']
    (histogram, _) = np.histogram(p_values['p_value'], HISTOGRAM_BINS, density=True)
    histogram = histogram[:-1] # Last column is massive and distorts graph
    bins = (np.arange(HISTOGRAM_BINS) / float(HISTOGRAM_BINS))[:-1] # 0->1
    pyplot.plot(bins, histogram, '--', label=label)

Last edited by dlawrence; 03-27-2013 at 12:52 AM.
dlawrence is offline   Reply With Quote
Old 04-11-2013, 10:17 PM   #2
KAP
Member
 
Location: Adelaide, Australia

Join Date: Mar 2010
Posts: 12
Default

Does anyone know the reason for this? I've been wondering about this too...
KAP is offline   Reply With Quote
Old 04-11-2013, 11:08 PM   #3
rboettcher
Member
 
Location: Berlin

Join Date: Oct 2010
Posts: 71
Default

I experienced the same behaviour (higher p-values for increasing group sizes) and therefore switched to edgeR for my differential expression analysis.
You should check out the DEXSeq paper, where Simon Anders et al. compare their method to cuffdiff (pre v2) and conclude that cuffdiff is not able to handle biological variability well.
This pretty much explains the whole problem, since more biological replicates also mean a higher variance within one group that can not be explained by Poisson.

Cheers
rboettcher is offline   Reply With Quote
Old 04-17-2013, 04:08 AM   #4
KAP
Member
 
Location: Adelaide, Australia

Join Date: Mar 2010
Posts: 12
Default

Thanks a lot for your reply, rboettcher! I'll check out the paper...
KAP is offline   Reply With Quote
Old 05-01-2013, 04:01 AM   #5
rboettcher
Member
 
Location: Berlin

Join Date: Oct 2010
Posts: 71
Default

I just saw that Trapnell et al. released a new Cufflinks/Cuffdiff version (2.1 http://cufflinks.cbcb.umd.edu/index.html) and address exactly the issues we discussed here in their description. I have not tested the new release, so if anyone has some results to share I'd be happy to see them.

Best
rboettcher is offline   Reply With Quote
Old 07-10-2013, 06:16 AM   #6
rnaEni
Junior Member
 
Location: Germany

Join Date: Jul 2012
Posts: 1
Default

Dear all,
I am using cufflinks version 2.1.1 and cummeRbund, and I still have the same problem with p-value(and q-value as well).
I tried to analyze promoter, splicing, tss, genes,cds, relcds of two different datasets but the p-value remains the same for every ids.
I post here the output of most signicant splicing of one dataset:
> sig_splicing
TSS_group_id gene_id sample_1 sample_2 status value_1 value_2
1368 TSS11228 XLOC_005502 A B OK 0 0
1556 TSS11398 XLOC_005592 A B OK 0 0
5025 TSS1452 XLOC_000719 A B OK 0 0
7141 TSS16424 XLOC_008057 A B OK 0 0
10187 TSS19166 XLOC_009832 A B OK 0 0
11285 TSS20153 XLOC_010503 A B OK 0 0
15014 TSS2351 XLOC_001163 A B OK 0 0
21310 TSS29177 XLOC_015217 A B OK 0 0
23911 TSS31517 XLOC_016311 A B OK 0 0
29587 TSS36626 XLOC_019160 A B OK 0 0
34385 TSS40944 XLOC_020984 A B OK 0 0
38979 TSS45079 XLOC_023079 A B OK 0 0
39335 TSS454 XLOC_000234 A B OK 0 0
41484 TSS47334 XLOC_024205 A B OK 0 0
41494 TSS47343 XLOC_024207 A B OK 0 0
48942 TSS54046 XLOC_027685 A B OK 0 0
50520 TSS55467 XLOC_028292 A B OK 0 0
50918 TSS55825 XLOC_028432 A B OK 0 0
60068 TSS6406 XLOC_003242 A B OK 0 0
62395 TSS66154 XLOC_033418 A B OK 0 0
JS_dist test_stat p_value q_value significant
1368 0.832555 0 5e-05 0.029265 yes
1556 0.689652 0 5e-05 0.029265 yes
5025 0.460169 0 5e-05 0.029265 yes
7141 0.224018 0 5e-05 0.029265 yes
10187 0.684703 0 5e-05 0.029265 yes
11285 0.661421 0 5e-05 0.029265 yes
15014 0.627367 0 5e-05 0.029265 yes
21310 0.458814 0 5e-05 0.029265 yes
23911 0.404639 0 5e-05 0.029265 yes
29587 0.535728 0 5e-05 0.029265 yes
34385 0.660841 0 5e-05 0.029265 yes
38979 0.531889 0 5e-05 0.029265 yes
39335 0.634366 0 5e-05 0.029265 yes
41484 0.204522 0 5e-05 0.029265 yes
41494 0.268069 0 5e-05 0.029265 yes
48942 0.567654 0 5e-05 0.029265 yes
50520 0.692903 0 5e-05 0.029265 yes
50918 0.603009 0 5e-05 0.029265 yes
60068 0.199754 0 5e-05 0.029265 yes
62395 0.603494 0 5e-05 0.029265 yes

If you already fixed this bug please let me know.
Regards
rnaEni is offline   Reply With Quote
Old 03-12-2014, 10:18 AM   #7
haggardd
Junior Member
 
Location: Oregon

Join Date: Apr 2013
Posts: 5
Default

^^bump for rnaEni

I have a similar problem (?) about CuffDiff output.

I have 50bp single-end RNA-seq data from a control and treatment. I have 3 replicates per treatment. I ran Cufflinks>Cuffmerge>CuffDiff and I get a similar result where the only significant genes, a total of about 42, have the most significant P-value CuffDiff allows (5E-05) and the q-value is 0.043 or something. Every other gene (even if it has a p-value of 0.0001) does not have a significant q-value (in this specific case, I get a q-value of 0.07). Does this make sense? I would expect to find more significant genes, but maybe I'm just trying to be too optimistic for this treatment.

On another note, if I add samples (with replicates) with old RNA-seq data with the same study design, to the cuffdiff run (with -L option to do all pairwise comparisons) the number of significant genes jumps to about 210. Does this also make sense? I would not expect the number to jump up that much because I would think the multiple test correction would be larger since CuffDiff is running way more DE tests with 4 samples instead of just 2.

Thoughts?

Thanks
haggardd is offline   Reply With Quote
Old 03-12-2014, 03:51 PM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by haggardd View Post
...the only significant genes, a total of about 42, have the most significant P-value CuffDiff allows (5E-05) and the q-value is 0.043 or something. Every other gene (even if it has a p-value of 0.0001) does not have a significant q-value (in this specific case, I get a q-value of 0.07). Does this make sense?
Sure, a p-value of 0.0001 isn't very likely to be significant when you consider that you're running 10's of thousands of tests at the same time. That's kind of the point of trying to account for multiple testing.

Quote:
I would expect to find more significant genes, but maybe I'm just trying to be too optimistic for this treatment.
There are a lot of reason to not find as many changes as you expect, the most likely simply being that naive expectations aren't very reliable. There's also the fact that we don't know anything about your experimental design, which may have its own issues.

Quote:
On another note, if I add samples (with replicates) with old RNA-seq data with the same study design, to the cuffdiff run (with -L option to do all pairwise comparisons) the number of significant genes jumps to about 210. Does this also make sense?
Sure, you not only have group but also batch differences. The latter can be pretty large.
dpryan is offline   Reply With Quote
Old 03-12-2014, 08:19 PM   #9
haggardd
Junior Member
 
Location: Oregon

Join Date: Apr 2013
Posts: 5
Default

Thanks for the speedy reply dpryan.

I understand how multiple testing corrections work but I just found it odd that genes with a p-value that small would not have a significant q-value. This was primarily based on my previous experience with other RNA-seq data sets where a p-value of even 0.01 had a significant q-value (<0.05). But every dataset is different...

Also, what I didn't mention is that I keep seeing many genes with the same exact p- and q-value. Here's some examples (sorry it looks messy):

gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
XLOC_042470 8:35520190-35520547 A B OK 835.767 0 #NAME? #NAME? 5.00E-05 0.0413262 yes
XLOC_045591 Zv9_NA634:7034-12375 A B OK 17.5896 0 #NAME? #NAME? 5.00E-05 0.0413262 yes
XLOC_003075 10:9797294-9798871 A B OK 2.29584 3.71833 0.695633 2.65798 0.0001 0.0723208 no
XLOC_004091 11:13161030-13168864 A B OK 1.80821 0.342578 -2.40005 -4.05064 0.0001 0.0723208 no
XLOC_008401 14:24980494-25008061 A B OK 1.51264 2.56794 0.763543 2.27828 0.0001 0.0723208 no
XLOC_039714 7:75191965-75197308 A B OK 17.7657 23.718 0.416886 2.45736 0.0001 0.0723208 no
XLOC_042443 8:32215711-32217186 A B OK 21.0697 28.2631 0.423748 2.59296 0.0001 0.0723208 no
XLOC_044102 9:23214760-23215741 A B OK 156.312 118.895 -0.394742 -2.33721 0.0001 0.0723208 no
XLOC_008940 14:31506583-31512448 A B OK 1.99747 1.38924 -0.523873 -2.09315 0.00015 0.0982472 no
XLOC_021634 20:44509173-44509769 A B OK 1.96707 4.6672 1.24651 3.83803 0.00015 0.0982472 no
XLOC_024274 22:7561581-7578214 A B OK 1.88902 3.07738 0.704064 2.46213 0.00015 0.0982472 no
XLOC_034952 5:66144043-66144594 A B OK 0 5.13361 inf #NAME? 0.00015 0.0982472 no
XLOC_041687 8:25134454-25148890 A B OK 3.58671 5.59972 0.642694 1.98236 0.00015 0.0982472 no
XLOC_019070 2:36641516-36653086 A B OK 1.98706 3.04699 0.616751 1.95934 0.0002 0.119703 no
XLOC_024666 22:37648352-37682613 A B OK 4.41966 6.45222 0.545861 2.50274 0.0002 0.119703 no
XLOC_025299 23:17954114-17955768 A B OK 9.64916 13.021 0.432364 1.79451 0.0002 0.119703 no
XLOC_026936 24:37059294-37061674 A B OK 14.8685 11.1599 -0.413935 -2.099 0.0002 0.119703 no
XLOC_037155 6:8161171-8185521 A B OK 3.49529 4.95005 0.502031 2.12399 0.0002 0.119703 no
XLOC_000264 1:24035505-24073635 A B OK 4.04379 6.0967 0.592321 1.9311 0.00025 0.139976 no
XLOC_010116 15:454559-541027 A B OK 662.216 163.827 -2.01512 -33.7345 0.00025 0.139976 no
XLOC_026603 23:46192824-46193823 A B OK 3.84477 6.04824 0.653617 2.71933 0.00025 0.139976 no
XLOC_042098 8:3896110-3969017 A B OK 4.34482 7.91808 0.865855 2.36129 0.00025 0.139976 no
XLOC_043936 9:4736133-4862966 A B OK 3.07409 4.28154 0.477969 1.98661 0.00035 0.192856 no

Now I know things have changed for Cufflinks 2.1.1 where it estimates the p-values differently than previous versions, but it just seems weird to see so many genes (this isn't all of them) with the same p- and q-values. It might just be a result of this change in the new version of Cufflinks. If it is please let me know. Again, past experience dictates this feeling since I've never seen that many duplicate p- or q-values before...

I might not have given an adequate account of the third comment so I'll try and explain myself better.

In my original post I said that running CuffDiff with just the control and treatment samples (lets say A1, A2, A3 and B1, B2, B3) gave ~40 genes with a q-value < 0.05. If I just add more data into CuffDiff by using older RNA-seq data (let's call these C1, C2, C3 and D1, D2, D3) and using the -L option to do all the possible pairwise comparisons (mainly out of curiosity) and again only look at the pairwise comparison of the samples A and B my significance gene lists jumps to ~200. And now genes with a p-value of even 0.01 now have significant q-values (albeit, they're ~0.04). It just seems odd.

I guess what I am trying to say is I would expect increasing the total number of significance tests (jumping from ~32,000 to +100,000) would lead to a much more stringent FDR when testing for differential expression. Unless I am completely off kilter (which is definitely likely) with how CuffDiff would treat FDR calculations when it does a bunch of pairwise comparisons. Could it be that what is driving the change is only the fact that I've added more data to the point that now CuffDiff estimates the pooled gene dispersion with higher accuracy so the variance between samples is decreased?

Maybe I'm just talking myself in circles. I dunno

Again, thanks for all the help and for keeping it constructive.
haggardd is offline   Reply With Quote
Old 03-13-2014, 05:33 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I agree that it seems odd to have so many identical p-value. You'd have to look into the internals of cufflinks to see if this seems reasonable or not (or just use something else, like DESeq2 or edgeR, that's easier to debug).

Regarding the FDR adjustment, remember that the BH method is dependent upon the distribution of p-values. So the more DE genes you have the lower the bar ends up being for a raw p-value to still be significant after adjustment. Consequently, a raw p-value of 0.01 might be significant after adjustment in one experiment (perhaps you're looking at cancer) and not another (almost anything I work on at the moment). Similarly, batch effects will often introduce a good number of differences, which will change the distribution and thereby the threshold for what still ends up being significant after correction.

Wipe memories of Bonferroni (where what you expected would have been the case) from your mind

Regarding the dispersion calculations, how cuffdiff will end up doing this will depend on how you had things setup. By default, it uses a pooled estimate approach, so the estimate will change with added groups. If you used "per-condition" then this probably wouldn't be an issue (of course you then need more replicates to get a good estimate, so...).
dpryan is offline   Reply With Quote
Old 08-24-2015, 04:04 AM   #11
wanfahmi
Member
 
Location: North Sea

Join Date: Apr 2008
Posts: 34
Default

Hello

I faced similar problem but how to solve the problem anyway?

I just ran cuffdiff across 30 human tissues to detect differentially expressed across tissues. The file output from splicing.diff is a file telling you that a differential splice variant between 2 tissues for all 30 tissues, is that right? But why in my case, I filtered the 'yes' for significantly expressed, the p-value most of the genes is 5e-5 (0.00005)? Is that the output from the cuffdiff? How can I get the real number of p-value so that I can rank and get the gene list from it?

my command line was: $ cuffdiff -o diffout -b ../genome.fa -p 8 -L <a list of 30 human tissues> -u merged_asm/merged.gtf <list of bam files>
wanfahmi is offline   Reply With Quote
Reply

Tags
cuffdiff, rnaseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO