Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem of methylKit: Too many false positive?

    Dear all,

    I have been implementing methylKit for a while for one of my project. But I feel there is a severe issue of methylKit in generating a lot of false positive.

    If I understand correctly, they used logistic regression which assume each C count is independent and pool CpG read from all samples together and test for group effect for a given CpG site/region. According to the the paper, the sample size here is actually not 6 vs. 6(number of samples) for example, but actually the total C counts at this site/regions across all samples.

    In this case, samples with higher coverage at a given site/regions will dominate the hypothesis testing. And many of the detect DMRs have really extreme beta-value in one or two samples.

    Considering the potential cause of unequal depth across sample, I have normalize the samples using "normalizeCoverage". But it didn't solve the problem. Actually, I think this function will be of little use since it has to be used before the "unite" function. That means after being normalized, although the total reads should be equal across samples, but each sample has different number of Cpg site/regions. After "unite", there will be some CpG site/regions removed since they are not covered by all samples by default.

    I am wondering if anyone has any idea how to deal with this problem?

  • #2
    The basic problem you have here is that whatever testing regime you use you will intrinsically have different power to detect differences if you are looking in fixed size windows simply because of the nature of BS-Seq data. The same thing is true in other techniques (eg RNA-Seq performing better on long transcripts), and there's fundamentally nothing you can do about it.

    Things which we've done to try to work around this to some extent:

    1) Work in fixed CpG windows rather than fixed bp. In this scenario you gain equal statistical power by forcing your analysis to put the same number of CpGs in each window. This means that over CpG islands you have relatively high resolution and in intergenic regions it can be very low. This makes some of the stats more comparable, but suffers from the fact that you may well get mixed signals in the longer CpG poor stretches which will dilute any real signal which exists.

    2) Don't solely rely on significance - use absolute filters as well. We take the view that a significant result should be there to lend weight to an absolute effect which has a reasonable magnitude. Taking huge sample sizes and finding that a change in methylation of 0.1% is significant may be a mathematically true result, but is not likely to be biologically relevant. We therefore use a couple of different metrics for measuring methylation over a region and insist on a certain absolute level of change in addition to statistical significance in order to put a region on our hit list. This will mean that we end up with higher false negative rates in our CpG poor regions, but the set of hits which we have should all be interesting.

    Comment


    • #3
      Hi Simon,

      Thanks for the reply.

      First some comments to your suggestions:
      1. fixed CpG windows sounds OK. But it still sounds a little arbitrary. I am wondering if there are ways to determine boundary of DMR as well, since CpGs within one window are not necessary having the same direction of methylation. (hyper or hypo).

      2. Yes, I use the default criteria in methylKit when declaring DE genes. (q-value<0.01 and methy.diff>0.25). Here actually the q-value will be over-significant when they assume each C count is independent as a sample. The methy.diff will be easily dominated by some extreme samples which have either large number of methylated or un-methylated counts.

      Do you mean I can calculate the beta-value for each biological sample and then calculate the difference of mean beta-value between two group to filter?

      Originally posted by simonandrews View Post
      The basic problem you have here is that whatever testing regime you use you will intrinsically have different power to detect differences if you are looking in fixed size windows simply because of the nature of BS-Seq data. The same thing is true in other techniques (eg RNA-Seq performing better on long transcripts), and there's fundamentally nothing you can do about it.

      Things which we've done to try to work around this to some extent:

      1) Work in fixed CpG windows rather than fixed bp. In this scenario you gain equal statistical power by forcing your analysis to put the same number of CpGs in each window. This means that over CpG islands you have relatively high resolution and in intergenic regions it can be very low. This makes some of the stats more comparable, but suffers from the fact that you may well get mixed signals in the longer CpG poor stretches which will dilute any real signal which exists.

      2) Don't solely rely on significance - use absolute filters as well. We take the view that a significant result should be there to lend weight to an absolute effect which has a reasonable magnitude. Taking huge sample sizes and finding that a change in methylation of 0.1% is significant may be a mathematically true result, but is not likely to be biologically relevant. We therefore use a couple of different metrics for measuring methylation over a region and insist on a certain absolute level of change in addition to statistical significance in order to put a region on our hit list. This will mean that we end up with higher false negative rates in our CpG poor regions, but the set of hits which we have should all be interesting.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        Yesterday, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 06:57 AM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 07:17 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-02-2024, 08:06 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-30-2024, 12:17 PM
      0 responses
      22 views
      0 likes
      Last Post seqadmin  
      Working...
      X