Hi,
I'm new to this forum, but It's been a while i'm looking on it to find some solutions about my problem. I have, for training purpose, to analyze data comming from a publication on Candida parapsilosis's differential expression in two condition : normoxia and hypoxia.
We were provided with two files : a Chip with log ratios for each spot and a RNAseq count file
Here's my problem : the original publication makes it very clear that a peculiar gene, CPAR2_403510 should be found as differentially expressed by RNAseq.
They used FPKM and a "hte statistical test to compute significance of FPKM observed changes" (I don't have any clue about what that could be)
EDIT : I guess "hte" is just a keyboard tipping error, but they don't give any further information. The paper says they used Quantile-based normalization to identify differentially expressed genes and that they corrected using fdr.
So, using DESeq vignette's, as instructed by my teacher (she even provided a barebone script to begin with, in case we would make mistakes), I started sorting interesting genes by adjusted pvalues.
Here's the data for my peculiar gene :
> res[1948,]
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
CPAR2_403510 12900.24 2287.847 28818.83 12.59649 3.654949 0.05991648 0.1682318
And here's the data extracted from the raw count file :
id N1 N2 N3 N4 N5 N6 H1 H2 H3 H4
CPAR2_403510 1004 1424 1196 1388 6315 6779 5047 89097 3246 11171
There is, as you can see 4 Hypoxia duplicates and 6 Normoxia duplicates. I already tried to remove some of these duplicates that appeared as weird on my heatmaps (N5 and N6 looked clusterized and made me think about a batch problem, but stripping them or not, it doesn't get my results better)
Being totally new to that kind of analyses, and almost to statistics in fact, I don't really know what I should do :
1) Should I rely on padj only ?
2) Should I rely on log2FoldChange only ?
3) Should I create a hybrid statistic taking both into account ? (is that even possible ? I sense that I would introduce some bias)
4) Is there any explanation ?
5) Should I discard the paper's information and just go on with those results ?
I believed that cutting below 0.1 padj was already kinda laxist. But then again, I don't really know.
I don't know if I provided enough information to be helped, but the goal of my exercise is to come up with a list of the "20" most differentially expressed genes, but cross-linking chip analysis (this one comes fine and I have all the genes I should have in it) and RNAseq results. But I have to admit that I often have like 700 genes of interest, and I'm not sure about the relevence of cutting pvalues always lower.
Would you have any clue ? I can't use DESeq2 as DESeq was imposed (well I could, but eventually I would have to do it using DESeq). I've tried method = "pooled" on my estimateDispersions as I read that on another thread, but it seems like the results are likewise.
Thank you for your replies !
I'm new to this forum, but It's been a while i'm looking on it to find some solutions about my problem. I have, for training purpose, to analyze data comming from a publication on Candida parapsilosis's differential expression in two condition : normoxia and hypoxia.
We were provided with two files : a Chip with log ratios for each spot and a RNAseq count file
Here's my problem : the original publication makes it very clear that a peculiar gene, CPAR2_403510 should be found as differentially expressed by RNAseq.
They used FPKM and a "hte statistical test to compute significance of FPKM observed changes" (I don't have any clue about what that could be)
EDIT : I guess "hte" is just a keyboard tipping error, but they don't give any further information. The paper says they used Quantile-based normalization to identify differentially expressed genes and that they corrected using fdr.
So, using DESeq vignette's, as instructed by my teacher (she even provided a barebone script to begin with, in case we would make mistakes), I started sorting interesting genes by adjusted pvalues.
Here's the data for my peculiar gene :
> res[1948,]
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
CPAR2_403510 12900.24 2287.847 28818.83 12.59649 3.654949 0.05991648 0.1682318
And here's the data extracted from the raw count file :
id N1 N2 N3 N4 N5 N6 H1 H2 H3 H4
CPAR2_403510 1004 1424 1196 1388 6315 6779 5047 89097 3246 11171
There is, as you can see 4 Hypoxia duplicates and 6 Normoxia duplicates. I already tried to remove some of these duplicates that appeared as weird on my heatmaps (N5 and N6 looked clusterized and made me think about a batch problem, but stripping them or not, it doesn't get my results better)
Being totally new to that kind of analyses, and almost to statistics in fact, I don't really know what I should do :
1) Should I rely on padj only ?
2) Should I rely on log2FoldChange only ?
3) Should I create a hybrid statistic taking both into account ? (is that even possible ? I sense that I would introduce some bias)
4) Is there any explanation ?
5) Should I discard the paper's information and just go on with those results ?
I believed that cutting below 0.1 padj was already kinda laxist. But then again, I don't really know.
I don't know if I provided enough information to be helped, but the goal of my exercise is to come up with a list of the "20" most differentially expressed genes, but cross-linking chip analysis (this one comes fine and I have all the genes I should have in it) and RNAseq results. But I have to admit that I often have like 700 genes of interest, and I'm not sure about the relevence of cutting pvalues always lower.
Would you have any clue ? I can't use DESeq2 as DESeq was imposed (well I could, but eventually I would have to do it using DESeq). I've tried method = "pooled" on my estimateDispersions as I read that on another thread, but it seems like the results are likewise.
Thank you for your replies !
Comment