Dear forum,
I have some questions that are likely very basic for many of you but I am hoping for answers anyways.
Im doing some overlap (intersection) analysis between a dataset of genomic regions (it could be SNPs or whatever) (Dataset A) and ChIP seq enriched regions (Dataset B) which consitutes <3% of the genome.
So in my 'true' Dataset A I get something like lets say 800 overlaps out of 2000.
Then I do size-matched 'random regions' without replacement of same size and number as the ChIP-dataset, e.g. I can do this 500 times with BEDtools.
Then the average overlap approaches what would be expected on average given the genome size.
So Dataset A significantly overlap Dataset B.
But how do I show this statistically without using simple Fisher or Chi Square tests?
I guess from a logical perspective I should somehow take into consideration each random simulation and accumulate the P-values into one? E.g. a fraction of simulations could show that overlap of Dataset A with random regions is 1900 of 2000 while another fraction could show 5 of 2000 and yet the majority could show ~10 of 2000 or whatever (numbers are all arbitary).
So I can see in some papers people use Monte Carlo simulations. I guess what Im doing is Monte Carlo without the statistical part of it?
Can anyone explain how to compare my true Dataset A overlap with Dataset B to my randomized overlaps of Dataset A to 500 simulations of Dataset B and get a P-value?
I have some questions that are likely very basic for many of you but I am hoping for answers anyways.
Im doing some overlap (intersection) analysis between a dataset of genomic regions (it could be SNPs or whatever) (Dataset A) and ChIP seq enriched regions (Dataset B) which consitutes <3% of the genome.
So in my 'true' Dataset A I get something like lets say 800 overlaps out of 2000.
Then I do size-matched 'random regions' without replacement of same size and number as the ChIP-dataset, e.g. I can do this 500 times with BEDtools.
Then the average overlap approaches what would be expected on average given the genome size.
So Dataset A significantly overlap Dataset B.
But how do I show this statistically without using simple Fisher or Chi Square tests?
I guess from a logical perspective I should somehow take into consideration each random simulation and accumulate the P-values into one? E.g. a fraction of simulations could show that overlap of Dataset A with random regions is 1900 of 2000 while another fraction could show 5 of 2000 and yet the majority could show ~10 of 2000 or whatever (numbers are all arbitary).
So I can see in some papers people use Monte Carlo simulations. I guess what Im doing is Monte Carlo without the statistical part of it?
Can anyone explain how to compare my true Dataset A overlap with Dataset B to my randomized overlaps of Dataset A to 500 simulations of Dataset B and get a P-value?
Comment