Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem of methylKit: Too many false positive?

    Dear all,

    I have been implementing methylKit for a while for one of my project. But I feel there is a severe issue of methylKit in generating a lot of false positive.

    If I understand correctly, they used logistic regression which assume each C count is independent and pool CpG read from all samples together and test for group effect for a given CpG site/region. According to the the paper, the sample size here is actually not 6 vs. 6(number of samples) for example, but actually the total C counts at this site/regions across all samples.

    In this case, samples with higher coverage at a given site/regions will dominate the hypothesis testing. And many of the detect DMRs have really extreme beta-value in one or two samples.

    Considering the potential cause of unequal depth across sample, I have normalize the samples using "normalizeCoverage". But it didn't solve the problem. Actually, I think this function will be of little use since it has to be used before the "unite" function. That means after being normalized, although the total reads should be equal across samples, but each sample has different number of Cpg site/regions. After "unite", there will be some CpG site/regions removed since they are not covered by all samples by default.

    I am wondering if anyone has any idea how to deal with this problem?

  • #2
    The basic problem you have here is that whatever testing regime you use you will intrinsically have different power to detect differences if you are looking in fixed size windows simply because of the nature of BS-Seq data. The same thing is true in other techniques (eg RNA-Seq performing better on long transcripts), and there's fundamentally nothing you can do about it.

    Things which we've done to try to work around this to some extent:

    1) Work in fixed CpG windows rather than fixed bp. In this scenario you gain equal statistical power by forcing your analysis to put the same number of CpGs in each window. This means that over CpG islands you have relatively high resolution and in intergenic regions it can be very low. This makes some of the stats more comparable, but suffers from the fact that you may well get mixed signals in the longer CpG poor stretches which will dilute any real signal which exists.

    2) Don't solely rely on significance - use absolute filters as well. We take the view that a significant result should be there to lend weight to an absolute effect which has a reasonable magnitude. Taking huge sample sizes and finding that a change in methylation of 0.1% is significant may be a mathematically true result, but is not likely to be biologically relevant. We therefore use a couple of different metrics for measuring methylation over a region and insist on a certain absolute level of change in addition to statistical significance in order to put a region on our hit list. This will mean that we end up with higher false negative rates in our CpG poor regions, but the set of hits which we have should all be interesting.

    Comment


    • #3
      Hi Simon,

      Thanks for the reply.

      First some comments to your suggestions:
      1. fixed CpG windows sounds OK. But it still sounds a little arbitrary. I am wondering if there are ways to determine boundary of DMR as well, since CpGs within one window are not necessary having the same direction of methylation. (hyper or hypo).

      2. Yes, I use the default criteria in methylKit when declaring DE genes. (q-value<0.01 and methy.diff>0.25). Here actually the q-value will be over-significant when they assume each C count is independent as a sample. The methy.diff will be easily dominated by some extreme samples which have either large number of methylated or un-methylated counts.

      Do you mean I can calculate the beta-value for each biological sample and then calculate the difference of mean beta-value between two group to filter?

      Originally posted by simonandrews View Post
      The basic problem you have here is that whatever testing regime you use you will intrinsically have different power to detect differences if you are looking in fixed size windows simply because of the nature of BS-Seq data. The same thing is true in other techniques (eg RNA-Seq performing better on long transcripts), and there's fundamentally nothing you can do about it.

      Things which we've done to try to work around this to some extent:

      1) Work in fixed CpG windows rather than fixed bp. In this scenario you gain equal statistical power by forcing your analysis to put the same number of CpGs in each window. This means that over CpG islands you have relatively high resolution and in intergenic regions it can be very low. This makes some of the stats more comparable, but suffers from the fact that you may well get mixed signals in the longer CpG poor stretches which will dilute any real signal which exists.

      2) Don't solely rely on significance - use absolute filters as well. We take the view that a significant result should be there to lend weight to an absolute effect which has a reasonable magnitude. Taking huge sample sizes and finding that a change in methylation of 0.1% is significant may be a mathematically true result, but is not likely to be biologically relevant. We therefore use a couple of different metrics for measuring methylation over a region and insist on a certain absolute level of change in addition to statistical significance in order to put a region on our hit list. This will mean that we end up with higher false negative rates in our CpG poor regions, but the set of hits which we have should all be interesting.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      27 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      26 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X