Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cuffdiff q values

    I am having trouble understanding Cuffdiff q-values. I have a gene, for example, that appears to be differentially expressed but is not called so even when using either two or three replicates. In two replicates, the gene levels appear like this:

    Condtion1 Rep 1 61.8278 58.4906 65.1649 OK
    Condition2 Rep 1 0.416481 0.176026 0.656937 OK

    Condition 1 Rep 2 52.4839 50.2009 54.7669 OK
    Condition 2 Rep 2 3.07902 2.52377 3.63427 OK

    but somehow, even though the p-value is significant, the q-value is almost one.

    Can anyone explain to me how this q-value adjustment takes place, and what other information in my data might be causing the adjustment to be so high? This gene seems to me, along with other genes in the data, like it should have been called differentially expressed.

    As a note: I ran cuffdiff on these two samples separately, not using replicates (just out of curiosity) and in the second pair, this gene is not called differentially expressed, though it is in the first. What's going on?

  • #2
    cuffdiff, as well as probably all other differential expression test softwares, use the false discovery rate correction described here: http://en.wikipedia.org/wiki/False_d...berg_procedure

    the reasons are probably described pretty well in that Wikipedia article. if you have 100 genes between two conditions you're essentially testing those conditions 100 times (even though each test is between different genes). so the p-values from the tests have to be corrected. my opinion is it's a little weird since that means the correction is influenced by how many genes you test (so you can cheat it a little by excluding genes which you think aren't testable). whatever - statistics don't always make sense.

    in your case you're not seeing an issue with the FDR correction. cuffdiff suffers from something else. just like several other tools, cuffdiff uses a parametric statistical test to test for significantly misexpressed genes. in order to do that kind of a test one usually needs a mean and a variance and possibly a number of degrees of freedom. one also needs to know something about the distribution of the metrics that are being compared. so what these tests to is model the expression distributions and then extract variance values from the models and plug them into some type of stat test (cuffdiff ends up using something similar to a t-test). last year people were thinking the poisson distribution was applicable to read-count data but now most of these tools use the negative-binomial distribution. i think in some cases a certain amount of pooled information goes into these models. obviously something is broken because emperically it's completely obvious that a change from 62 to 0.4 in expression is massive and if those were means from normal distributions and you used a t-test you'd have a p-value so small that the FDR correction would have an insignificant impact on its value.

    so the problem here is likely cuffdiff's modeling and estimation of the variance for this gene.

    the only thing i can recommend is for you to try a different tool to test for differential expression. unfortunately nothing is as tidy as the tophat, cuffdiff pipeline but it will better for you to try something else for this test. i guarantee that you'll be more satisfied with the results. DESeq, is a good tool. I've read that ebseq is a smart new tool as well but I haven't tried it out. both of those run in R and both of them need count data so that means you'll need read counts for your genes. HTSeq count can do a good job of counting hits at genes from your alignments assuming you us an annotation that fits. The author would recommend the ensemble annotation. all i can recommend is for you to use something that gives you a unique name per gene locus.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      Thanks for your helpful response. I had looked into DESeq, but it also didn't return what I expected (and also doesn't handle multi-mapping reads, which I have a lot of). I have been thinking our data may be more variable than we can work with, so I'm looking into mask files and this sort of thing, but I'm wondering - is there a standard metric(s) for reporting variability? I'm pretty new at this, and wondering, generally, how I might decide whether or not my samples are usable for this kind of analysis.

      Comment


      • #4
        Dear community, dear sdriscoll,
        I have a question which could be a related with cuffdiff.
        By comparing the output from isoform_exp.diff and gene_exp.diff it looks like their is less trasnscripts which are differentially expressed than genes. This sound like counter intuitive to me as I would expect at least the same number in both cases. Could you explain that?

        Comment


        • #5
          I'm pretty sure in one of their papers they point out the possibility that there may be differential isoform expression but not differential locus expression. It is possible that there could be isoform switching or activation in a locus with significantly disturbing the total read count level of a locus. I would expect, however, that a differentially expressed locus would also have differentially expressed isoforms in it. Since cuffdiff models the variances separately for the isoform test and the locus level test sometimes it might report a single isoform gene as significant in one test but not the other. That I think is an unfortunate side effect of the statistics used and the variance estimations. The p-values are just a guide. Every gene may actually have different tolerances to differential expression. For some genes a two fold change could be significant regardless of the statistical test. It's important to keep that in mind.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          31 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X