Our questions might be a bit different than most. As background for our lab, we are not experts in RNA-Seq data and are learning it as we go.
There have been a number of studies that have sequenced "normal" individuals, and we are interested in using this public data to answer a few simple questions. Specifically, we're interested to find out the gene expression range between "normal" individuals in the population for a very small number of genes (which we are interested in from our wet-lab work). We are not comparing normal against a disease, or anything like that.
We were wondering if there is a recommended way to normalize this data so that we can compare the gene expression of Gene X in individual A to individual B (letting us ultimately determine the range of expression in Gene X for all individuals).
We know that just using RPKM values is not the way to go. Housekeeping genes are full of many issues (for instance, there is no guarantee that they are actually stable across individuals). We have looked at quantile normalized values, but will this let you compare between individuals the way we want?
Naively, we are thinking that percent ranking of gene X against all genes in all individuals might work well. We are thinking this would also let us compare between studies and potentially even between species. But then this would remove the resolution (for instance, the number 2 gene vs. the number 3 gene might actually have a very large difference in expression, even if their "ranks" are nearly identical).
We've also thought about using "total expression" as the denominator. That is, we would divide expression for gene X / total expression in each individual. I know people have shied away from using this when looking at differential expression analysis, but if we already know what genes we want to know the expression of, we are thinking this method "could work". But like percent rank, we're not sure if there are any limitations that we are missing.
Does anyone know of a good way approach this question? We naively thought this would be a simple analysis (i.e. just grab the expression values for each and compare), but as we learn more it seems more complicated than we initially expected.
We appreciate any insight.
There have been a number of studies that have sequenced "normal" individuals, and we are interested in using this public data to answer a few simple questions. Specifically, we're interested to find out the gene expression range between "normal" individuals in the population for a very small number of genes (which we are interested in from our wet-lab work). We are not comparing normal against a disease, or anything like that.
We were wondering if there is a recommended way to normalize this data so that we can compare the gene expression of Gene X in individual A to individual B (letting us ultimately determine the range of expression in Gene X for all individuals).
We know that just using RPKM values is not the way to go. Housekeeping genes are full of many issues (for instance, there is no guarantee that they are actually stable across individuals). We have looked at quantile normalized values, but will this let you compare between individuals the way we want?
Naively, we are thinking that percent ranking of gene X against all genes in all individuals might work well. We are thinking this would also let us compare between studies and potentially even between species. But then this would remove the resolution (for instance, the number 2 gene vs. the number 3 gene might actually have a very large difference in expression, even if their "ranks" are nearly identical).
We've also thought about using "total expression" as the denominator. That is, we would divide expression for gene X / total expression in each individual. I know people have shied away from using this when looking at differential expression analysis, but if we already know what genes we want to know the expression of, we are thinking this method "could work". But like percent rank, we're not sure if there are any limitations that we are missing.
Does anyone know of a good way approach this question? We naively thought this would be a simple analysis (i.e. just grab the expression values for each and compare), but as we learn more it seems more complicated than we initially expected.
We appreciate any insight.
Comment