SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Multiple Testing correction for Differentially Expressed Genes cdes79 RNA Sequencing 0 09-19-2013 04:08 AM
multiple testing correction question gene_x Bioinformatics 4 11-08-2012 11:51 PM
multiple FPKM problem for single gene in gene_exp.diff after running cuffdiff ngs RNA Sequencing 4 03-30-2011 01:55 PM
cuffdiff0.9.2: output problem when input multiple SAM files mrfox Bioinformatics 0 11-22-2010 07:53 AM
cufflinks problem: one gene have multiple rpkm values calliopsis Bioinformatics 1 05-25-2010 08:18 AM

Reply
 
Thread Tools
Old 11-05-2014, 01:48 AM   #1
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default The multiple correction problem in pathways

Background: What I want to do is to take my RNA-seq data (coming from Tophat2 --> featureCounts --> DESeq2, with associated p-values) and see differentially expressed genes in a specified pathways - for example, all the genes specified to be in the MAPK/ERK pathways according to KEGG.

So, the multiple testing problem... I was told that, if I were only interested in the subset of all the genes reported by DESeq2, I would need to subset the data according to whatever gene list and perform multiple testing correction on that subset only, rather than doing it on the whole of the genome. Firstly, is this correct? Assuming this was correct, I performed standard Benjamini-Hochberg FDR correction with the p.adjust function in R for my subset of pathway genes.

I have since read up a bit more on the various FDR correction methods, and as far as I can tell they all rely on the assumption that the data is independent (at least the BH method). My particular problem is then how I could use any of those methods when I'm quite certain that the genes in my subset are, in fact, dependent, seeing as they're all part of the same pathway?

How do you guys perform multiple testing corrections with data that you expect to be dependent?

Last edited by ErikFas; 11-05-2014 at 02:46 AM.
ErikFas is offline   Reply With Quote
Old 11-05-2014, 04:09 AM   #2
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

You are selecting your differentially expressed genes based on differences in their relative expression from your whole transcriptome experiment. Those tests of significance are independent.

What you do with your list of genes that pass your chosen significance threshold(s) is a separate issue, such as an ontology enrichment analysis using those genes which are determined to have been differentially expressed.

You are talking about two distinct analyses - significance testing of whole transcriptome data to determine differentially expressed genes, and then an enrichment analysis (of some kind) to see how those differentially expressed genes reflect some signal of known functional ontology (which will have its own separate multiple test corrections, depending on how many gene sets are in the ontology and how many genes you provide in your list of differentially expressed genes).

What you do not want to do is choose a set of genes initially based solely on a limited functional basis, and then only test for differential expression amongst those. By analyzing differential expression from the entire detected transcriptome, you maximize your variance model for that large data set. If you only wanted to look at a very few select genes all along, then you should have designed an experiment to measure expression of just those and not the whole transcriptome. Instead, with whole transcriptome data, you want to first determine which of your genes, amongst all that were detected, appear to actually be differentially expressed, and then see if, amongst those, you have any evidence of a particular functional pathway being over-represented (i.e. beyond what chance alone would predict from a whole transcriptome snapshot of expression).

Also, just to be clear, the p-value or FDR for a gene's test for differentially expression is meaningless in terms of its representation amongst a given ontology categories elements, and particularly so if you have arbitrarily pulled out just the genes for a given pathway to being with. The real question is, amongst the genes you determine (independently) to be differentially expressed, which are represented in defined ontology categories and does the number in an ontology category indicate a functional group representation that is greater than expected by chance?
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.

Last edited by mbblack; 11-05-2014 at 04:24 AM.
mbblack is offline   Reply With Quote
Old 11-05-2014, 04:39 AM   #3
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Okay, so if I understand you correctly what I was told to do was either wrong or me misinterpreting it. This is what I was told, simply:

RNA-seq experiment --> DESeq2 (p-values) --> subset for genes of interest --> FDR calculations for subset

I was told this was to make sure that no DEGs were missed in the subset because of the larger n (i.e. 20000ish) for the transcriptome, and that I needed to use the smaller n (e.g. 300) for the subset for the FDR values to be correct. (For the DE-analysis of the whole transcriptome (which I do as well) I course have done the FDR calculations with n=20000ish.) Is this wrong, or did I misinterpret the instructions? If I understand you correctly you're saying I should subset the data after the FDR calculations.
ErikFas is offline   Reply With Quote
Old 11-05-2014, 05:05 AM   #4
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

If the intent of your analysis is to see if you have a significant over-representation of MAPK/ERK pathway genes in your organism or treatment, then yes, that is a biased approach. By limiting your testing of differential gene expression to only the subset of genes of specific interest, you are biasing your determination of whether your original experiment actually shows any evidence for those genes or that pathway being involved. You actually simultaneously sampled relative expression levels from the whole available transcriptome, not just those select few genes.

What I would be doing is:

RNA-seq whole transcriptome experiment -> test for differential gene expression (by whatever tool and whatever significance and/or magnitude thresholds you feel appropriate for calling a gene differentially expressed) -> test if, amongst all your differentially expressed genes, you have a significant over-representation of the functional group of interest.

So lets say you use an FDR<0.05 and an absolute Fold Change of >1.5 as your signifance thresholds for differential expression, and those cutoffs give you a list of 900 putatively differentially expressed genes. The question then is, if you use those 900 genes in a functional ontology enrichment (say a classic hypergeometric distribution analysis), do you have evidence of a significant over-representation of MAPK/ERK pathway genes in that list. I would NOT arbitrarily subset that gene list for enrichment, as again that biases the results by per-suposing a gene list of interest. By doing a whole transcriptome experiment, the intent (usually) is to see what, if any evidence is present for a particular functional group being affected, not to check if specific prior selected genes are affected.

That hypergeometric analysis will give you its own properly adjusted p-value for the overlap of some subset of your genes and the defined gene lists in the ontology. If you find that of the 900 DE genes, there are >5 overlapping (or some other arbitrary minimum overlap threshold) with the MAPK/ERK pathway list, and with a pathway overlap FDR<0.05, you can reasonably say that this particular pathway is indicated as having been affected by your experimental treatment.

If you already knew that only the MAPK/ERK pathway was of interest or would be affected, then I do not understand why one would even perform a whole transcriptomic differential expression experiment in the first place.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.

Last edited by mbblack; 11-05-2014 at 05:48 AM.
mbblack is offline   Reply With Quote
Old 11-05-2014, 05:52 AM   #5
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Thanks for your informative replies! I feel I'm probably not being very clear, though; let me put my question in another way. Given that I already have whole-transcriptome data, which was either given to me or created for a different purpose (i.e. I was not part of the experimental design), I want to do a "spot-check" for the MAPK/ERK pathway. I would like to see the differential expression of the genes within the pathway, which I could then visualize. So, in this context, my intent is not to find over-represented pathways, although that is certainly something that I aim to do with the data.

For example, if I'm using the gene list of the KEGG MAPK/ERK pathway, I could then layer the differential expression data (i.e. fold change + significance) on top of a KEGG network image of the pathway (by using Cytoscape or another visualization program).

Is that a clearer way of describing what I'm after? The reason I'm interested in this is that I have RNA-seq data that, when created, was meant for just FPKM analyses within various cell types for a database unrelated to myself. I'm currently looking into various other ways I could use this data for my own projects, and this was one of the ideas I had.

Last edited by ErikFas; 11-05-2014 at 05:54 AM.
ErikFas is offline   Reply With Quote
Old 11-05-2014, 10:52 AM   #6
ronton
Member
 
Location: US

Join Date: Jun 2014
Posts: 34
Default

Interpreting sequencing data is like a huge statistics problem and I would say that several correction steps are often employed such as 'rewarding,' 'penalizing,' etc.

Hopefully others have more specific answers.
ronton is offline   Reply With Quote
Old 11-05-2014, 11:12 AM   #7
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

Quote:
Originally Posted by ErikFas View Post
Thanks for your informative replies! I feel I'm probably not being very clear, though; let me put my question in another way. Given that I already have whole-transcriptome data, which was either given to me or created for a different purpose (i.e. I was not part of the experimental design), I want to do a "spot-check" for the MAPK/ERK pathway. I would like to see the differential expression of the genes within the pathway, which I could then visualize. So, in this context, my intent is not to find over-represented pathways, although that is certainly something that I aim to do with the data.

For example, if I'm using the gene list of the KEGG MAPK/ERK pathway, I could then layer the differential expression data (i.e. fold change + significance) on top of a KEGG network image of the pathway (by using Cytoscape or another visualization program).

Is that a clearer way of describing what I'm after? The reason I'm interested in this is that I have RNA-seq data that, when created, was meant for just FPKM analyses within various cell types for a database unrelated to myself. I'm currently looking into various other ways I could use this data for my own projects, and this was one of the ideas I had.
Even in that situation, I would not adjust the significance level based on just that subset of genes. You could still pull out all the genes for that pathway, regardless of significance, and display them with their fold change values and their p-value or FDR. But that significance should be indicative of the significance as determined by the original experiment. I think it would misleading to now alter the FDR values based on a sub-sample that does not in fact reflect the sampling from which the gene's relative expression was actually based on.

If that were the case, you would need to not merely adjust the FDR based on the reduced sample size, but you would need to go back to the raw data and repeat your statistical test of significance for just that subset (re-normalize with just that small subset, and repeat your statistical analysis with just that small subset), and then adjust those newly derived p-values. You cannot simply adjust the p-values from the original analysis as those p-values were based on a variance model from the whole transcriptome data set. Your small subset of genes may represent a very different set of data than the original complete data set.

It may well be the case that some genes that were statistically significant in the original analysis are not at all significant when analyzed as part of a select sub-sample of the original data since the two sets of data may represent very different distributions of values. In that case, a simple re-calculation of the FDR using the original analysis p-values may be very misleading.

If it were me, I would not alter the statistical significance results from the DE analysis at all. You can still include non-significant genes in your figure or graph if you wish to show that their fold change was still consistent with the others (or not as the case may be). But you want the significance to reflect the actual differential gene expression determination from the original DE analysis. Otherwise your presentation is misleading in terms of actual DE relative to the overall cellular background transcript abundance you actually measured.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.

Last edited by mbblack; 11-05-2014 at 11:28 AM.
mbblack is offline   Reply With Quote
Old 11-06-2014, 04:45 AM   #8
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Thank you VERY much, that explains it very well and now I understand your reasoning. I will ask my guy for his thoughts on this.
ErikFas is offline   Reply With Quote
Old 01-12-2015, 06:30 AM   #9
LauGP
Junior Member
 
Location: Havana

Join Date: Nov 2014
Posts: 7
Default Pathways and networks analysis of RNA-Seq data

Hello,

Maybe this is a bit unrelated with this thread, but I will try it ;-)
After obtaining a DEGs list from an RNA-Seq experiment (Tophat-HTSeq Count-DESeq2), I would like to check which biological functions and pathways are over-represented/enriched in my dataset. First I used the IPA web app and I got some Top Diseases and Bio Functions, and Top networks as well. However I am concerned about the gene length bias which is typical in RNA-Seq data. I wrote to IPA customer support to ask about this, but still I didnt get replies. Does anyone know if IPA analysis takes into account the gene length bias?
At this moment Im starting GO analysis with GOSeq, which I have read its a tool where the gene length is taken into account for GO overrepresentation tests. Overall Im confused about if GOSeq could be a suitable substitute for IPA.
LauGP is offline   Reply With Quote
Old 01-12-2015, 07:59 PM   #10
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

It's been a while since I used IPA, but does it have a "this data came from RNAseq" check button when you upload your DEGlist? I'm guessing not, but even if it did, the "length of a gene" is dependent on which annotation set you used for HTSeq Count and TopHat, which IPA also doesn't know about.

Finally, you'll get alot more replies if you start your own thread rather than tacking a completely different topic onto someone else's thread.
mikep is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO