Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fdr and p-values for NGS comparison ttest

    Hello, I am new to NGS analysis, but that's the task I've been assigned. I have received NGS data that I am trying to decipher.

    I am attempting to learn what exactly is meant by "unadjusted p-value" and "FDR" in looking at comparison ttests of genes (the comparisons are between NGS of animals treated with drug or placebo). I understand the basic concepts, but not how to functionally make use of them. Most of the values seem fairly large (well over 0.1 for p-values, in the 0.1 to 0.9 range for FDR) when looking at data sets of ~20,000 to 40,000 members. My goal here is to determine a value for each that would allow me to gate on the genes with meaningful expression differences. Is there a specific value I should use as the boundary, or some way to calculate it based on the sample size or something?

  • #2
    Bueller? Bueller?

    Comment


    • #3
      Um, he's sick. My best friend's sister's boyfriend's brother's girlfriend heard from this guy who knows this kid who's going with the girl who saw Ferris pass out at 31 Flavors last night. I guess it's pretty serious.

      Anyway, ignore unadjusted p-values, they're largely useless when you do multiple testing. An adjusted p-value (or FDR) threshold of 0.1 is pretty common, though there's wiggle room there.

      Comment


      • #4
        Thanks.

        Almost none of the FDRs are below 0.1 (~20 out of 20,000 genes compared). The company who did the sequencing indicates it's because we didn't have enough samples (three mice were used per group) and says we should use the unadjusted p-value instead.

        I was thinking instead I would just ignore the p-values and FDR and focus on the log2FC. There were multiple group comparisons performed (drug A treated vs. vehicle, drug B vs. vehicle, drug A vs. untreated, drug B vs. untreated). So I figure if genes go up (or down) across the board, it's likely to be real. Does anyone see any problems there?

        I know none of this is hard fact, but no microarray or NGS study like this would be anyway.

        Comment


        • #5
          3 mice per group is indeed a bit low for many uses (I generally use 6 per group). I would be hesitant to trust most companies to provide statistical analyses, the companies tend to not have the remotest clue what they're doing (in my experience at least). Anyway, any company that suggests going with unadjusted p-values is too incompetent to be trusted (this wasn't Zymo Research, was it?).

          You might just relax the adjusted p-value threshold a bit and then candidates for follow-up according to log2FC. The biggest issue with just going directly with fold-change is that you'll get a lot of noisy low-expressors, which will tend to have bigger fold-changes due simply to noise.

          Comment


          • #6
            Wouldn't the large fold change due to noise issue be at least moderated by the fact that I'm looking at multiple groups (e.g. if gene Z has a large FC in the Drug A vs. vehicle, Drug B vs. vehicle, Drug A vs. untreated, and Drug B vs. untreated then it's probably not a noise issue)?

            Or does the fact that I'm looking at such a large number of genes basically eliminate that benefit, since it's going to happen randomly at a certain frequency?

            I haven't even gotten into the individual signals yet (up until this sentence I've still been focusing on the ttest comparisons between treatments), but maybe that's where I go next--and eliminate genes whose signals are below a certain threshold.

            Comment


            • #7
              With that many comparisons, you'd expect a number of low expressing genes to randomly show consistent fold-changes between group comparisons.

              Comment


              • #8
                Originally posted by ArthurDeodat View Post
                Wouldn't the large fold change due to noise issue be at least moderated by the fact that I'm looking at multiple groups (e.g. if gene Z has a large FC in the Drug A vs. vehicle, Drug B vs. vehicle, Drug A vs. untreated, and Drug B vs. untreated then it's probably not a noise issue)?

                Or does the fact that I'm looking at such a large number of genes basically eliminate that benefit, since it's going to happen randomly at a certain frequency?

                I haven't even gotten into the individual signals yet (up until this sentence I've still been focusing on the ttest comparisons between treatments), but maybe that's where I go next--and eliminate genes whose signals are below a certain threshold.
                With 20,000 genes being compared, having just a 1% false positive rate (i.e. the equivalent of an FDR<0.01) means 200 genes would be expected to be changing purely by chance alone. And that applies independently to each of your treatments. Even if you see the same gene with greater than 2 fold change in all, you cannot assume that gene is necessarily truly differentially expressed in any, as in each case there is a fairly high and independent probability of it being a false positive. That's the problem with using just fold change across so many simultaneous comparisons. In your case, its also compounded by the fact that your ratio's for fold change are based on a mean of only 3 replicates so not exactly the most robust data to estimate fold change form. The whole point of an FDR is to control for the limits of potential false positives within the subset of items deemed signficant. So an FDR limit of <0.05 and a gene list of 20 genes passing that stringency means at you can expect that no more than 1 of those 20 is actually a false positive.

                However, you have two things working against you statistically. One is the low number of replicates, but there's nothing you can do about that now. The other is the sheer number of comparisons. If they show values for 20,000 genes, then they had to have included zero count data in the analysis. One of the arguments for excluding all zero count data (aside from the fact it merely indicates a lack of detection, not a lack of presense) is that you can reduce the number of simultaneous comparisons to just that subset of data for which you actually had reliable count data. Many people in fact exclude more than merely zero count data, but use a minimum data threshold of say a raw count of 5, 10, 20 or some other low but admittedly arbitrary cutoff for transcript inclusion. That may reduce your list of gene candidates to fewer than 10,000 (depending on your read depth obtained), but that in turn means for a less harsh FDR correction (not simply because of the reduction in simultaneous comparisons, but also by eliminating a lot of the very high variance, low count data, as the actual distribution of p-values obtained also affects the FDR correction). Your ability then to confidently detect differential gene expression may be vastly improved.

                You may be far better off just getting your raw mapped read data and performing your own differential gene expression analysis yourself. Think about exploring the effects of different normalization algorithms as normalization can have a profound effect on differential gene expression results. Also think about what data to include or exclude (e.g. throw out any gene for which you had fewer than 5 or 10 counts in any sample, or had zero counts in any sample - consider them unreliable abundance estimates and simply exclude them).

                I know people hate throwing out data (although again, zero count data indicates no detection, not true absence of a transcript - failure to detect cannot be assumed to equate to absence of an item), but trying to include everything can in fact hurt your analyses and interpretation of what you actually did detect with high confidence. Low count data is notoriously noisy data as well (has a very high variance relative to higher count data) and that in turn can detract from detecting meaningful differential expression.

                Without knowing any of the details and having had a say in what was done and how it was done, I would never simply trust any canned analyses, or base my biological interpretations on analyses done by someone else.
                Last edited by mbblack; 07-09-2014, 07:11 AM.
                Michael Black, Ph.D.
                ScitoVation LLC. RTP, N.C.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:35 AM
                0 responses
                14 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-06-2024, 07:17 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Working...
                X