I wish to train a classificatory computational model for the predicting the occurrence (or not) of certain genomic features using NarrowPeaks peak calls (BED files) obtained for several epigenetic modifications (mostly histone modifications obtained from the Roadmap Epigenomics Project) as predictors.
The genomic features have variable length (usually around 200bp). The NarrowPeaks also have variable length and abundance.
I would like to ask your suggestions about the best way of using the peak calls data as prediction for the model.
For instance, for a given epigenetic modification I may:
1-Use the simple fact that the genomic feature intersects any peak of epigenetic mark as a Yes/No categorical value
2-Use the signalValue, pValue or qValue of the intersected peak as a numeric variable.
3-Use some sort of relative overlap between the features
Should I worry about calculating some kind of enrichment score for the cases were the peaks are smaller than the features and may overlap many? If so, how is the best way (in your opinion) to proceed?
The genomic features have variable length (usually around 200bp). The NarrowPeaks also have variable length and abundance.
I would like to ask your suggestions about the best way of using the peak calls data as prediction for the model.
For instance, for a given epigenetic modification I may:
1-Use the simple fact that the genomic feature intersects any peak of epigenetic mark as a Yes/No categorical value
2-Use the signalValue, pValue or qValue of the intersected peak as a numeric variable.
3-Use some sort of relative overlap between the features
Should I worry about calculating some kind of enrichment score for the cases were the peaks are smaller than the features and may overlap many? If so, how is the best way (in your opinion) to proceed?