Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The multiple correction problem in pathways

    Background: What I want to do is to take my RNA-seq data (coming from Tophat2 --> featureCounts --> DESeq2, with associated p-values) and see differentially expressed genes in a specified pathways - for example, all the genes specified to be in the MAPK/ERK pathways according to KEGG.

    So, the multiple testing problem... I was told that, if I were only interested in the subset of all the genes reported by DESeq2, I would need to subset the data according to whatever gene list and perform multiple testing correction on that subset only, rather than doing it on the whole of the genome. Firstly, is this correct? Assuming this was correct, I performed standard Benjamini-Hochberg FDR correction with the p.adjust function in R for my subset of pathway genes.

    I have since read up a bit more on the various FDR correction methods, and as far as I can tell they all rely on the assumption that the data is independent (at least the BH method). My particular problem is then how I could use any of those methods when I'm quite certain that the genes in my subset are, in fact, dependent, seeing as they're all part of the same pathway?

    How do you guys perform multiple testing corrections with data that you expect to be dependent?
    Last edited by ErikFas; 11-05-2014, 03:46 AM.

  • #2
    You are selecting your differentially expressed genes based on differences in their relative expression from your whole transcriptome experiment. Those tests of significance are independent.

    What you do with your list of genes that pass your chosen significance threshold(s) is a separate issue, such as an ontology enrichment analysis using those genes which are determined to have been differentially expressed.

    You are talking about two distinct analyses - significance testing of whole transcriptome data to determine differentially expressed genes, and then an enrichment analysis (of some kind) to see how those differentially expressed genes reflect some signal of known functional ontology (which will have its own separate multiple test corrections, depending on how many gene sets are in the ontology and how many genes you provide in your list of differentially expressed genes).

    What you do not want to do is choose a set of genes initially based solely on a limited functional basis, and then only test for differential expression amongst those. By analyzing differential expression from the entire detected transcriptome, you maximize your variance model for that large data set. If you only wanted to look at a very few select genes all along, then you should have designed an experiment to measure expression of just those and not the whole transcriptome. Instead, with whole transcriptome data, you want to first determine which of your genes, amongst all that were detected, appear to actually be differentially expressed, and then see if, amongst those, you have any evidence of a particular functional pathway being over-represented (i.e. beyond what chance alone would predict from a whole transcriptome snapshot of expression).

    Also, just to be clear, the p-value or FDR for a gene's test for differentially expression is meaningless in terms of its representation amongst a given ontology categories elements, and particularly so if you have arbitrarily pulled out just the genes for a given pathway to being with. The real question is, amongst the genes you determine (independently) to be differentially expressed, which are represented in defined ontology categories and does the number in an ontology category indicate a functional group representation that is greater than expected by chance?
    Last edited by mbblack; 11-05-2014, 05:24 AM.
    Michael Black, Ph.D.
    ScitoVation LLC. RTP, N.C.

    Comment


    • #3
      Okay, so if I understand you correctly what I was told to do was either wrong or me misinterpreting it. This is what I was told, simply:

      RNA-seq experiment --> DESeq2 (p-values) --> subset for genes of interest --> FDR calculations for subset

      I was told this was to make sure that no DEGs were missed in the subset because of the larger n (i.e. 20000ish) for the transcriptome, and that I needed to use the smaller n (e.g. 300) for the subset for the FDR values to be correct. (For the DE-analysis of the whole transcriptome (which I do as well) I course have done the FDR calculations with n=20000ish.) Is this wrong, or did I misinterpret the instructions? If I understand you correctly you're saying I should subset the data after the FDR calculations.

      Comment


      • #4
        If the intent of your analysis is to see if you have a significant over-representation of MAPK/ERK pathway genes in your organism or treatment, then yes, that is a biased approach. By limiting your testing of differential gene expression to only the subset of genes of specific interest, you are biasing your determination of whether your original experiment actually shows any evidence for those genes or that pathway being involved. You actually simultaneously sampled relative expression levels from the whole available transcriptome, not just those select few genes.

        What I would be doing is:

        RNA-seq whole transcriptome experiment -> test for differential gene expression (by whatever tool and whatever significance and/or magnitude thresholds you feel appropriate for calling a gene differentially expressed) -> test if, amongst all your differentially expressed genes, you have a significant over-representation of the functional group of interest.

        So lets say you use an FDR<0.05 and an absolute Fold Change of >1.5 as your signifance thresholds for differential expression, and those cutoffs give you a list of 900 putatively differentially expressed genes. The question then is, if you use those 900 genes in a functional ontology enrichment (say a classic hypergeometric distribution analysis), do you have evidence of a significant over-representation of MAPK/ERK pathway genes in that list. I would NOT arbitrarily subset that gene list for enrichment, as again that biases the results by per-suposing a gene list of interest. By doing a whole transcriptome experiment, the intent (usually) is to see what, if any evidence is present for a particular functional group being affected, not to check if specific prior selected genes are affected.

        That hypergeometric analysis will give you its own properly adjusted p-value for the overlap of some subset of your genes and the defined gene lists in the ontology. If you find that of the 900 DE genes, there are >5 overlapping (or some other arbitrary minimum overlap threshold) with the MAPK/ERK pathway list, and with a pathway overlap FDR<0.05, you can reasonably say that this particular pathway is indicated as having been affected by your experimental treatment.

        If you already knew that only the MAPK/ERK pathway was of interest or would be affected, then I do not understand why one would even perform a whole transcriptomic differential expression experiment in the first place.
        Last edited by mbblack; 11-05-2014, 06:48 AM.
        Michael Black, Ph.D.
        ScitoVation LLC. RTP, N.C.

        Comment


        • #5
          Thanks for your informative replies! I feel I'm probably not being very clear, though; let me put my question in another way. Given that I already have whole-transcriptome data, which was either given to me or created for a different purpose (i.e. I was not part of the experimental design), I want to do a "spot-check" for the MAPK/ERK pathway. I would like to see the differential expression of the genes within the pathway, which I could then visualize. So, in this context, my intent is not to find over-represented pathways, although that is certainly something that I aim to do with the data.

          For example, if I'm using the gene list of the KEGG MAPK/ERK pathway, I could then layer the differential expression data (i.e. fold change + significance) on top of a KEGG network image of the pathway (by using Cytoscape or another visualization program).

          Is that a clearer way of describing what I'm after? The reason I'm interested in this is that I have RNA-seq data that, when created, was meant for just FPKM analyses within various cell types for a database unrelated to myself. I'm currently looking into various other ways I could use this data for my own projects, and this was one of the ideas I had.
          Last edited by ErikFas; 11-05-2014, 06:54 AM.

          Comment


          • #6
            Interpreting sequencing data is like a huge statistics problem and I would say that several correction steps are often employed such as 'rewarding,' 'penalizing,' etc.

            Hopefully others have more specific answers.

            Comment


            • #7
              Originally posted by ErikFas View Post
              Thanks for your informative replies! I feel I'm probably not being very clear, though; let me put my question in another way. Given that I already have whole-transcriptome data, which was either given to me or created for a different purpose (i.e. I was not part of the experimental design), I want to do a "spot-check" for the MAPK/ERK pathway. I would like to see the differential expression of the genes within the pathway, which I could then visualize. So, in this context, my intent is not to find over-represented pathways, although that is certainly something that I aim to do with the data.

              For example, if I'm using the gene list of the KEGG MAPK/ERK pathway, I could then layer the differential expression data (i.e. fold change + significance) on top of a KEGG network image of the pathway (by using Cytoscape or another visualization program).

              Is that a clearer way of describing what I'm after? The reason I'm interested in this is that I have RNA-seq data that, when created, was meant for just FPKM analyses within various cell types for a database unrelated to myself. I'm currently looking into various other ways I could use this data for my own projects, and this was one of the ideas I had.
              Even in that situation, I would not adjust the significance level based on just that subset of genes. You could still pull out all the genes for that pathway, regardless of significance, and display them with their fold change values and their p-value or FDR. But that significance should be indicative of the significance as determined by the original experiment. I think it would misleading to now alter the FDR values based on a sub-sample that does not in fact reflect the sampling from which the gene's relative expression was actually based on.

              If that were the case, you would need to not merely adjust the FDR based on the reduced sample size, but you would need to go back to the raw data and repeat your statistical test of significance for just that subset (re-normalize with just that small subset, and repeat your statistical analysis with just that small subset), and then adjust those newly derived p-values. You cannot simply adjust the p-values from the original analysis as those p-values were based on a variance model from the whole transcriptome data set. Your small subset of genes may represent a very different set of data than the original complete data set.

              It may well be the case that some genes that were statistically significant in the original analysis are not at all significant when analyzed as part of a select sub-sample of the original data since the two sets of data may represent very different distributions of values. In that case, a simple re-calculation of the FDR using the original analysis p-values may be very misleading.

              If it were me, I would not alter the statistical significance results from the DE analysis at all. You can still include non-significant genes in your figure or graph if you wish to show that their fold change was still consistent with the others (or not as the case may be). But you want the significance to reflect the actual differential gene expression determination from the original DE analysis. Otherwise your presentation is misleading in terms of actual DE relative to the overall cellular background transcript abundance you actually measured.
              Last edited by mbblack; 11-05-2014, 12:28 PM.
              Michael Black, Ph.D.
              ScitoVation LLC. RTP, N.C.

              Comment


              • #8
                Thank you VERY much, that explains it very well and now I understand your reasoning. I will ask my guy for his thoughts on this.

                Comment


                • #9
                  Pathways and networks analysis of RNA-Seq data

                  Hello,

                  Maybe this is a bit unrelated with this thread, but I will try it ;-)
                  After obtaining a DEGs list from an RNA-Seq experiment (Tophat-HTSeq Count-DESeq2), I would like to check which biological functions and pathways are over-represented/enriched in my dataset. First I used the IPA web app and I got some Top Diseases and Bio Functions, and Top networks as well. However I am concerned about the gene length bias which is typical in RNA-Seq data. I wrote to IPA customer support to ask about this, but still I didn´t get replies. Does anyone know if IPA analysis takes into account the gene length bias?
                  At this moment I´m starting GO analysis with GOSeq, which I have read it’s a tool where the gene length is taken into account for GO overrepresentation tests. Overall I´m confused about if GOSeq could be a suitable substitute for IPA.

                  Comment


                  • #10
                    It's been a while since I used IPA, but does it have a "this data came from RNAseq" check button when you upload your DEGlist? I'm guessing not, but even if it did, the "length of a gene" is dependent on which annotation set you used for HTSeq Count and TopHat, which IPA also doesn't know about.

                    Finally, you'll get alot more replies if you start your own thread rather than tacking a completely different topic onto someone else's thread.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    7 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    7 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    66 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X