Dear all,
I have been implementing methylKit for a while for one of my project. But I feel there is a severe issue of methylKit in generating a lot of false positive.
If I understand correctly, they used logistic regression which assume each C count is independent and pool CpG read from all samples together and test for group effect for a given CpG site/region. According to the the paper, the sample size here is actually not 6 vs. 6(number of samples) for example, but actually the total C counts at this site/regions across all samples.
In this case, samples with higher coverage at a given site/regions will dominate the hypothesis testing. And many of the detect DMRs have really extreme beta-value in one or two samples.
Considering the potential cause of unequal depth across sample, I have normalize the samples using "normalizeCoverage". But it didn't solve the problem. Actually, I think this function will be of little use since it has to be used before the "unite" function. That means after being normalized, although the total reads should be equal across samples, but each sample has different number of Cpg site/regions. After "unite", there will be some CpG site/regions removed since they are not covered by all samples by default.
I am wondering if anyone has any idea how to deal with this problem?
I have been implementing methylKit for a while for one of my project. But I feel there is a severe issue of methylKit in generating a lot of false positive.
If I understand correctly, they used logistic regression which assume each C count is independent and pool CpG read from all samples together and test for group effect for a given CpG site/region. According to the the paper, the sample size here is actually not 6 vs. 6(number of samples) for example, but actually the total C counts at this site/regions across all samples.
In this case, samples with higher coverage at a given site/regions will dominate the hypothesis testing. And many of the detect DMRs have really extreme beta-value in one or two samples.
Considering the potential cause of unequal depth across sample, I have normalize the samples using "normalizeCoverage". But it didn't solve the problem. Actually, I think this function will be of little use since it has to be used before the "unite" function. That means after being normalized, although the total reads should be equal across samples, but each sample has different number of Cpg site/regions. After "unite", there will be some CpG site/regions removed since they are not covered by all samples by default.
I am wondering if anyone has any idea how to deal with this problem?
Comment