Could you please help understand where I went wrong on the issue described below.
Suppose I am interested in detecting differential expression (DE) of a fixed transcript X between Tumor and Normal conditions with only one replicate (library) per condition.
Tumor library has 1 million reads, of which 50,000 map to transcript X.
Normal library has 2 million reads, of which 200,000 map to transcript X.
There are two major modeling frameworks.
1)The most conventional one (implemented in edgeR, EBSeq, Cuffdiff, and many others) is to
a) adjust the “raw” count for the library size; b) assume that the adjusted count comes from a certain distribution with unlimited support (Poisson, Negative Binomial, etc) ; and c) fit a regression model where covariates correspond to the conditions and adjusted counts correspond to the response.
However, in this example there will be only one adjusted count per condition, and all such models will have zero degrees of freedom for the error. No p-values will be produced. In particular, edgeR is pitched as a method to use for low replication scenario, but it still requires at least one condition that has two or more replicates.
2)Assume a Binomial trials scheme for each library. Eg, for Tumor library there is 1 million trials, and 50,000 successes. The null hypothesis says that the probability of success is the same in both libraries. This framework is equivalent to fitting a logistic regression with a factor that has two levels.
Most importantly, this model is well replicated: it has as many observations as the total number of reads in the dataset, i.e. 3 million. Each observation is equal to 1, if the corresponding read maps to transcript X, and zero, otherwise. In Method 1), X has only one observation under Tumor condition. In Method 2), it has 1 million observations under Tumor.
When there are a few factors in the model, it should work the same way. Because of low replication, Method 1) will often boil down to n-way ANOVA with one replicate per cell. If we switch to Method 2), each point that was considered a single observation in 1) will expand to as many replicates as there are reads in the corresponding library.
Therefore, I fail to understand why framework 2) has not been used all over the place to avoid the replication problem that is so common in RNA-Seq studies. Apparently, there should be a good reason. If you have an idea, please let me know.
Regards,
Nik
Suppose I am interested in detecting differential expression (DE) of a fixed transcript X between Tumor and Normal conditions with only one replicate (library) per condition.
Tumor library has 1 million reads, of which 50,000 map to transcript X.
Normal library has 2 million reads, of which 200,000 map to transcript X.
There are two major modeling frameworks.
1)The most conventional one (implemented in edgeR, EBSeq, Cuffdiff, and many others) is to
a) adjust the “raw” count for the library size; b) assume that the adjusted count comes from a certain distribution with unlimited support (Poisson, Negative Binomial, etc) ; and c) fit a regression model where covariates correspond to the conditions and adjusted counts correspond to the response.
However, in this example there will be only one adjusted count per condition, and all such models will have zero degrees of freedom for the error. No p-values will be produced. In particular, edgeR is pitched as a method to use for low replication scenario, but it still requires at least one condition that has two or more replicates.
2)Assume a Binomial trials scheme for each library. Eg, for Tumor library there is 1 million trials, and 50,000 successes. The null hypothesis says that the probability of success is the same in both libraries. This framework is equivalent to fitting a logistic regression with a factor that has two levels.
Most importantly, this model is well replicated: it has as many observations as the total number of reads in the dataset, i.e. 3 million. Each observation is equal to 1, if the corresponding read maps to transcript X, and zero, otherwise. In Method 1), X has only one observation under Tumor condition. In Method 2), it has 1 million observations under Tumor.
When there are a few factors in the model, it should work the same way. Because of low replication, Method 1) will often boil down to n-way ANOVA with one replicate per cell. If we switch to Method 2), each point that was considered a single observation in 1) will expand to as many replicates as there are reads in the corresponding library.
Therefore, I fail to understand why framework 2) has not been used all over the place to avoid the replication problem that is so common in RNA-Seq studies. Apparently, there should be a good reason. If you have an idea, please let me know.
Regards,
Nik
Comment