Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions: Simulation statistics (genomic overlaps)

    Dear forum,
    I have some questions that are likely very basic for many of you but I am hoping for answers anyways.

    Im doing some overlap (intersection) analysis between a dataset of genomic regions (it could be SNPs or whatever) (Dataset A) and ChIP seq enriched regions (Dataset B) which consitutes <3% of the genome.

    So in my 'true' Dataset A I get something like lets say 800 overlaps out of 2000.

    Then I do size-matched 'random regions' without replacement of same size and number as the ChIP-dataset, e.g. I can do this 500 times with BEDtools.

    Then the average overlap approaches what would be expected on average given the genome size.

    So Dataset A significantly overlap Dataset B.

    But how do I show this statistically without using simple Fisher or Chi Square tests?

    I guess from a logical perspective I should somehow take into consideration each random simulation and accumulate the P-values into one? E.g. a fraction of simulations could show that overlap of Dataset A with random regions is 1900 of 2000 while another fraction could show 5 of 2000 and yet the majority could show ~10 of 2000 or whatever (numbers are all arbitary).

    So I can see in some papers people use Monte Carlo simulations. I guess what Im doing is Monte Carlo without the statistical part of it?

    Can anyone explain how to compare my true Dataset A overlap with Dataset B to my randomized overlaps of Dataset A to 500 simulations of Dataset B and get a P-value?

  • #2
    The p-value is the fraction of simulations with equal or more overlap. So if your true dataset had 800 overlaps and only 2 out of a 500 simulations had that many or more overlaps, then the p-value would be 0.004. Similarly:

    Code:
    table(rnorm(100000) < qnorm(0.05))
    will give a "True"/"True+False" ratio of ~0.05, which is good since that's the p-value I used Increasing the number of simulations will increase the precision (but not accuracy) of the resulting p-value. In an ideal simulation, you'd try to randomly select regions with similar mappability and probably GC characteristics, though perhaps bedtools is doing that (I've never tried to use it for that so I don't know).

    Comment


    • #3
      Hej dpryan,

      And thanks for your answer, yes the randomized regions are "close to" the original ChIP dataset in terms of "overall" location, and exclusion of unmappable (random) genome sequence, and the numbers mentioned were just arbitary. However, I guess that literally "true" random simulation conditions are difficult to obtain, also due to unknown biases.

      The true numbers are like this (just an example from a single analysis)

      I have a Dataset A of 3500 points, 1100 overlap a ChIP-seq peak dataset from ENCODE, of mean peak size ~1000 bp and number of features is 40.000. Then this would roughly correspond to ~5% of the genome part Im analyzing. Then in this exact scenario I would randomly select 40.000 regions of 1000 bp and do the same overlap analysis e.g. 100 or 500 times or whatever. This approaches the average that would be expected given the number of points in A and the size of the genome the the random regions constitute.

      I tried Fisher's testing for contingency tables and it gave P-values in the order of E-350.

      The problem is Im no statistician, and I guess this kind of statistics is too simple?!

      Comment


      • #4
        you might be interested to see this tool

        Comment


        • #5
          I have one more question I hope some of you can answer.

          E.g. if I compare Group A and Group B and the difference is X, then i compare Group A and C, the difference is Y, and then I compare Group B and C and the difference is Z.

          E.g.
          Group A: ABCDEF LMN
          Group B: ABCDEFGHIJK
          Group C: ABCDEF OPQRSTUVWXYZ

          So there are more differences between Group C vs. Group A and B, compared to differences between Group A and B.

          I dont know if this is to trivial again but I hope someone can pinpoint me in the correct direction.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X