Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cuffdiff p-value distributions and replicates

    Hi, has anyone looked at the distribution of Cuffdiff's p-values?

    As I understand it, p-values should be evenly distributed (ie flat histogram) for the null hypothesis, which should be the case for most genomics data (as only a very small number of genes change between control/treatment).

    Here are p-values from some of my cuffdiff gene_exp.diff files:



    I noticed that there seems to be a pattern if you group by how many replicates you pass in:



    I also re-ran a 3 rep sample with just 1 rep per condition and ended up with a similar graph to the 1-rep ones above.

    Looking at cuffdiff's differential.cpp:

    Code:
                double curr_log_fpkm_var = (curr.FPKM_variance) / (curr.FPKM * curr.FPKM);
                double prev_log_fpkm_var = (prev.FPKM_variance) / (prev.FPKM * prev.FPKM);
                
                double numerator = log(prev.FPKM / curr.FPKM);
                
                double denominator = sqrt(prev_log_fpkm_var + curr_log_fpkm_var);
                stat = numerator / denominator;
    It seems that more variance -> higher p-values, so if more reps = higher variance, this is shifting the values away from 0...

    Q: Is there something wrong here? (with my samples or cuffdiff's treatment of p-values?)

    I'm using Cuffdiff 2.0.1

    I'm only graphing genes where status == OK, and remove those with p-value=1, which comes from a different path in differential.cpp and doesn't use the t-tests. The graph code is basically:

    Code:
        gene_exp_df = pd.DataFrame.from_csv(gene_exp, sep='\t')
        status_ok_df = gene_exp_df[gene_exp_df['status'] == 'OK']
        (histogram, _) = np.histogram(p_values['p_value'], HISTOGRAM_BINS, density=True)
        histogram = histogram[:-1] # Last column is massive and distorts graph
        bins = (np.arange(HISTOGRAM_BINS) / float(HISTOGRAM_BINS))[:-1] # 0->1
        pyplot.plot(bins, histogram, '--', label=label)
    Last edited by dlawrence; 03-27-2013, 12:52 AM.

  • #2
    Does anyone know the reason for this? I've been wondering about this too...

    Comment


    • #3
      I experienced the same behaviour (higher p-values for increasing group sizes) and therefore switched to edgeR for my differential expression analysis.
      You should check out the DEXSeq paper, where Simon Anders et al. compare their method to cuffdiff (pre v2) and conclude that cuffdiff is not able to handle biological variability well.
      This pretty much explains the whole problem, since more biological replicates also mean a higher variance within one group that can not be explained by Poisson.

      Cheers

      Comment


      • #4
        Thanks a lot for your reply, rboettcher! I'll check out the paper...

        Comment


        • #5
          I just saw that Trapnell et al. released a new Cufflinks/Cuffdiff version (2.1 http://cufflinks.cbcb.umd.edu/index.html) and address exactly the issues we discussed here in their description. I have not tested the new release, so if anyone has some results to share I'd be happy to see them.

          Best

          Comment


          • #6
            Dear all,
            I am using cufflinks version 2.1.1 and cummeRbund, and I still have the same problem with p-value(and q-value as well).
            I tried to analyze promoter, splicing, tss, genes,cds, relcds of two different datasets but the p-value remains the same for every ids.
            I post here the output of most signicant splicing of one dataset:
            > sig_splicing
            TSS_group_id gene_id sample_1 sample_2 status value_1 value_2
            1368 TSS11228 XLOC_005502 A B OK 0 0
            1556 TSS11398 XLOC_005592 A B OK 0 0
            5025 TSS1452 XLOC_000719 A B OK 0 0
            7141 TSS16424 XLOC_008057 A B OK 0 0
            10187 TSS19166 XLOC_009832 A B OK 0 0
            11285 TSS20153 XLOC_010503 A B OK 0 0
            15014 TSS2351 XLOC_001163 A B OK 0 0
            21310 TSS29177 XLOC_015217 A B OK 0 0
            23911 TSS31517 XLOC_016311 A B OK 0 0
            29587 TSS36626 XLOC_019160 A B OK 0 0
            34385 TSS40944 XLOC_020984 A B OK 0 0
            38979 TSS45079 XLOC_023079 A B OK 0 0
            39335 TSS454 XLOC_000234 A B OK 0 0
            41484 TSS47334 XLOC_024205 A B OK 0 0
            41494 TSS47343 XLOC_024207 A B OK 0 0
            48942 TSS54046 XLOC_027685 A B OK 0 0
            50520 TSS55467 XLOC_028292 A B OK 0 0
            50918 TSS55825 XLOC_028432 A B OK 0 0
            60068 TSS6406 XLOC_003242 A B OK 0 0
            62395 TSS66154 XLOC_033418 A B OK 0 0
            JS_dist test_stat p_value q_value significant
            1368 0.832555 0 5e-05 0.029265 yes
            1556 0.689652 0 5e-05 0.029265 yes
            5025 0.460169 0 5e-05 0.029265 yes
            7141 0.224018 0 5e-05 0.029265 yes
            10187 0.684703 0 5e-05 0.029265 yes
            11285 0.661421 0 5e-05 0.029265 yes
            15014 0.627367 0 5e-05 0.029265 yes
            21310 0.458814 0 5e-05 0.029265 yes
            23911 0.404639 0 5e-05 0.029265 yes
            29587 0.535728 0 5e-05 0.029265 yes
            34385 0.660841 0 5e-05 0.029265 yes
            38979 0.531889 0 5e-05 0.029265 yes
            39335 0.634366 0 5e-05 0.029265 yes
            41484 0.204522 0 5e-05 0.029265 yes
            41494 0.268069 0 5e-05 0.029265 yes
            48942 0.567654 0 5e-05 0.029265 yes
            50520 0.692903 0 5e-05 0.029265 yes
            50918 0.603009 0 5e-05 0.029265 yes
            60068 0.199754 0 5e-05 0.029265 yes
            62395 0.603494 0 5e-05 0.029265 yes

            If you already fixed this bug please let me know.
            Regards

            Comment


            • #7
              ^^bump for rnaEni

              I have a similar problem (?) about CuffDiff output.

              I have 50bp single-end RNA-seq data from a control and treatment. I have 3 replicates per treatment. I ran Cufflinks>Cuffmerge>CuffDiff and I get a similar result where the only significant genes, a total of about 42, have the most significant P-value CuffDiff allows (5E-05) and the q-value is 0.043 or something. Every other gene (even if it has a p-value of 0.0001) does not have a significant q-value (in this specific case, I get a q-value of 0.07). Does this make sense? I would expect to find more significant genes, but maybe I'm just trying to be too optimistic for this treatment.

              On another note, if I add samples (with replicates) with old RNA-seq data with the same study design, to the cuffdiff run (with -L option to do all pairwise comparisons) the number of significant genes jumps to about 210. Does this also make sense? I would not expect the number to jump up that much because I would think the multiple test correction would be larger since CuffDiff is running way more DE tests with 4 samples instead of just 2.

              Thoughts?

              Thanks

              Comment


              • #8
                Originally posted by haggardd View Post
                ...the only significant genes, a total of about 42, have the most significant P-value CuffDiff allows (5E-05) and the q-value is 0.043 or something. Every other gene (even if it has a p-value of 0.0001) does not have a significant q-value (in this specific case, I get a q-value of 0.07). Does this make sense?
                Sure, a p-value of 0.0001 isn't very likely to be significant when you consider that you're running 10's of thousands of tests at the same time. That's kind of the point of trying to account for multiple testing.

                I would expect to find more significant genes, but maybe I'm just trying to be too optimistic for this treatment.
                There are a lot of reason to not find as many changes as you expect, the most likely simply being that naive expectations aren't very reliable. There's also the fact that we don't know anything about your experimental design, which may have its own issues.

                On another note, if I add samples (with replicates) with old RNA-seq data with the same study design, to the cuffdiff run (with -L option to do all pairwise comparisons) the number of significant genes jumps to about 210. Does this also make sense?
                Sure, you not only have group but also batch differences. The latter can be pretty large.

                Comment


                • #9
                  Thanks for the speedy reply dpryan.

                  I understand how multiple testing corrections work but I just found it odd that genes with a p-value that small would not have a significant q-value. This was primarily based on my previous experience with other RNA-seq data sets where a p-value of even 0.01 had a significant q-value (<0.05). But every dataset is different...

                  Also, what I didn't mention is that I keep seeing many genes with the same exact p- and q-value. Here's some examples (sorry it looks messy):

                  gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
                  XLOC_042470 8:35520190-35520547 A B OK 835.767 0 #NAME? #NAME? 5.00E-05 0.0413262 yes
                  XLOC_045591 Zv9_NA634:7034-12375 A B OK 17.5896 0 #NAME? #NAME? 5.00E-05 0.0413262 yes
                  XLOC_003075 10:9797294-9798871 A B OK 2.29584 3.71833 0.695633 2.65798 0.0001 0.0723208 no
                  XLOC_004091 11:13161030-13168864 A B OK 1.80821 0.342578 -2.40005 -4.05064 0.0001 0.0723208 no
                  XLOC_008401 14:24980494-25008061 A B OK 1.51264 2.56794 0.763543 2.27828 0.0001 0.0723208 no
                  XLOC_039714 7:75191965-75197308 A B OK 17.7657 23.718 0.416886 2.45736 0.0001 0.0723208 no
                  XLOC_042443 8:32215711-32217186 A B OK 21.0697 28.2631 0.423748 2.59296 0.0001 0.0723208 no
                  XLOC_044102 9:23214760-23215741 A B OK 156.312 118.895 -0.394742 -2.33721 0.0001 0.0723208 no
                  XLOC_008940 14:31506583-31512448 A B OK 1.99747 1.38924 -0.523873 -2.09315 0.00015 0.0982472 no
                  XLOC_021634 20:44509173-44509769 A B OK 1.96707 4.6672 1.24651 3.83803 0.00015 0.0982472 no
                  XLOC_024274 22:7561581-7578214 A B OK 1.88902 3.07738 0.704064 2.46213 0.00015 0.0982472 no
                  XLOC_034952 5:66144043-66144594 A B OK 0 5.13361 inf #NAME? 0.00015 0.0982472 no
                  XLOC_041687 8:25134454-25148890 A B OK 3.58671 5.59972 0.642694 1.98236 0.00015 0.0982472 no
                  XLOC_019070 2:36641516-36653086 A B OK 1.98706 3.04699 0.616751 1.95934 0.0002 0.119703 no
                  XLOC_024666 22:37648352-37682613 A B OK 4.41966 6.45222 0.545861 2.50274 0.0002 0.119703 no
                  XLOC_025299 23:17954114-17955768 A B OK 9.64916 13.021 0.432364 1.79451 0.0002 0.119703 no
                  XLOC_026936 24:37059294-37061674 A B OK 14.8685 11.1599 -0.413935 -2.099 0.0002 0.119703 no
                  XLOC_037155 6:8161171-8185521 A B OK 3.49529 4.95005 0.502031 2.12399 0.0002 0.119703 no
                  XLOC_000264 1:24035505-24073635 A B OK 4.04379 6.0967 0.592321 1.9311 0.00025 0.139976 no
                  XLOC_010116 15:454559-541027 A B OK 662.216 163.827 -2.01512 -33.7345 0.00025 0.139976 no
                  XLOC_026603 23:46192824-46193823 A B OK 3.84477 6.04824 0.653617 2.71933 0.00025 0.139976 no
                  XLOC_042098 8:3896110-3969017 A B OK 4.34482 7.91808 0.865855 2.36129 0.00025 0.139976 no
                  XLOC_043936 9:4736133-4862966 A B OK 3.07409 4.28154 0.477969 1.98661 0.00035 0.192856 no

                  Now I know things have changed for Cufflinks 2.1.1 where it estimates the p-values differently than previous versions, but it just seems weird to see so many genes (this isn't all of them) with the same p- and q-values. It might just be a result of this change in the new version of Cufflinks. If it is please let me know. Again, past experience dictates this feeling since I've never seen that many duplicate p- or q-values before...

                  I might not have given an adequate account of the third comment so I'll try and explain myself better.

                  In my original post I said that running CuffDiff with just the control and treatment samples (lets say A1, A2, A3 and B1, B2, B3) gave ~40 genes with a q-value < 0.05. If I just add more data into CuffDiff by using older RNA-seq data (let's call these C1, C2, C3 and D1, D2, D3) and using the -L option to do all the possible pairwise comparisons (mainly out of curiosity) and again only look at the pairwise comparison of the samples A and B my significance gene lists jumps to ~200. And now genes with a p-value of even 0.01 now have significant q-values (albeit, they're ~0.04). It just seems odd.

                  I guess what I am trying to say is I would expect increasing the total number of significance tests (jumping from ~32,000 to +100,000) would lead to a much more stringent FDR when testing for differential expression. Unless I am completely off kilter (which is definitely likely) with how CuffDiff would treat FDR calculations when it does a bunch of pairwise comparisons. Could it be that what is driving the change is only the fact that I've added more data to the point that now CuffDiff estimates the pooled gene dispersion with higher accuracy so the variance between samples is decreased?

                  Maybe I'm just talking myself in circles. I dunno

                  Again, thanks for all the help and for keeping it constructive.

                  Comment


                  • #10
                    I agree that it seems odd to have so many identical p-value. You'd have to look into the internals of cufflinks to see if this seems reasonable or not (or just use something else, like DESeq2 or edgeR, that's easier to debug).

                    Regarding the FDR adjustment, remember that the BH method is dependent upon the distribution of p-values. So the more DE genes you have the lower the bar ends up being for a raw p-value to still be significant after adjustment. Consequently, a raw p-value of 0.01 might be significant after adjustment in one experiment (perhaps you're looking at cancer) and not another (almost anything I work on at the moment). Similarly, batch effects will often introduce a good number of differences, which will change the distribution and thereby the threshold for what still ends up being significant after correction.

                    Wipe memories of Bonferroni (where what you expected would have been the case) from your mind

                    Regarding the dispersion calculations, how cuffdiff will end up doing this will depend on how you had things setup. By default, it uses a pooled estimate approach, so the estimate will change with added groups. If you used "per-condition" then this probably wouldn't be an issue (of course you then need more replicates to get a good estimate, so...).

                    Comment


                    • #11
                      Hello

                      I faced similar problem but how to solve the problem anyway?

                      I just ran cuffdiff across 30 human tissues to detect differentially expressed across tissues. The file output from splicing.diff is a file telling you that a differential splice variant between 2 tissues for all 30 tissues, is that right? But why in my case, I filtered the 'yes' for significantly expressed, the p-value most of the genes is 5e-5 (0.00005)? Is that the output from the cuffdiff? How can I get the real number of p-value so that I can rank and get the gene list from it?

                      my command line was: $ cuffdiff -o diffout -b ../genome.fa -p 8 -L <a list of 30 human tissues> -u merged_asm/merged.gtf <list of bam files>

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X