Seqanswers Leaderboard Ad

**dpryan** · 02-27-2014, 11:59 AM

The p-value is the fraction of simulations with equal or more overlap. So if your true dataset had 800 overlaps and only 2 out of a 500 simulations had that many or more overlaps, then the p-value would be 0.004. Similarly:

Code:

table(rnorm(100000) < qnorm(0.05))

will give a "True"/"True+False" ratio of ~0.05, which is good since that's the p-value I used

Increasing the number of simulations will increase the precision (but not accuracy) of the resulting p-value. In an ideal simulation, you'd try to randomly select regions with similar mappability and probably GC characteristics, though perhaps bedtools is doing that (I've never tried to use it for that so I don't know).

**puggie** · 02-27-2014, 12:25 PM

Hej dpryan,

And thanks for your answer, yes the randomized regions are "close to" the original ChIP dataset in terms of "overall" location, and exclusion of unmappable (random) genome sequence, and the numbers mentioned were just arbitary. However, I guess that literally "true" random simulation conditions are difficult to obtain, also due to unknown biases.

The true numbers are like this (just an example from a single analysis)

I have a Dataset A of 3500 points, 1100 overlap a ChIP-seq peak dataset from ENCODE, of mean peak size ~1000 bp and number of features is 40.000. Then this would roughly correspond to ~5% of the genome part Im analyzing. Then in this exact scenario I would randomly select 40.000 regions of 1000 bp and do the same overlap analysis e.g. 100 or 500 times or whatever. This approaches the average that would be expected given the number of points in A and the size of the genome the the random regions constitute.

I tried Fisher's testing for contingency tables and it gave P-values in the order of E-350.

The problem is Im no statistician, and I guess this kind of statistics is too simple?!

**crazyhottommy** · 02-27-2014, 07:36 PM

you might be interested to see this tool

http://www.cgat.org/~andreas/documentation/gat/tutorialIntervalOverlap.html

**puggie** · 03-05-2014, 09:04 AM

I have one more question I hope some of you can answer.

E.g. if I compare Group A and Group B and the difference is X, then i compare Group A and C, the difference is Y, and then I compare Group B and C and the difference is Z.

E.g.
Group A: ABCDEF LMN
Group B: ABCDEFGHIJK
Group C: ABCDEF OPQRSTUVWXYZ

So there are more differences between Group C vs. Group A and B, compared to differences between Group A and B.

I dont know if this is to trivial again but I hope someone can pinpoint me in the correct direction.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Questions: Simulation statistics (genomic overlaps)

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News