My ChIP-Seq data has a relatively high (40-85%) number of duplicate reads. This presents problems for Peak calling.
Approach 1. Using MACS, I can analyze the data and choose to count duplicate reads as a single read. This produces approx. 5000 peaks and the data looks reasonable, however some important/known occupancy regions are absent.
Approach 2. Alternatively I can use MACS and specify 'KEEP ALL' for the duplicates. Naturally this means the peaks in the first approach become dwarfed by the large duplicate peaks, so a completely different set of peaks are identified.
This approach produces approx. 100,000 peaks (depending on other settings) including the important occupancy region we are expecting which also has a very high MACS score (approx 75 percentile within this dataset).
What would be the best way to proceed?
Would it be reasonable to filter the 100,000+ peak data with a MACS score threshold? Would there be a statistically valid (i.e. non-arbitrary) way to do this? e.g. converting the MACS scores to p-values and using an FDR cut-off?
Would it be reasonable to merge the two datasets? i.e. the peaks identified with the duplicate reads and those identified without.
Approach 1. Using MACS, I can analyze the data and choose to count duplicate reads as a single read. This produces approx. 5000 peaks and the data looks reasonable, however some important/known occupancy regions are absent.
Approach 2. Alternatively I can use MACS and specify 'KEEP ALL' for the duplicates. Naturally this means the peaks in the first approach become dwarfed by the large duplicate peaks, so a completely different set of peaks are identified.
This approach produces approx. 100,000 peaks (depending on other settings) including the important occupancy region we are expecting which also has a very high MACS score (approx 75 percentile within this dataset).
What would be the best way to proceed?
Would it be reasonable to filter the 100,000+ peak data with a MACS score threshold? Would there be a statistically valid (i.e. non-arbitrary) way to do this? e.g. converting the MACS scores to p-values and using an FDR cut-off?
Would it be reasonable to merge the two datasets? i.e. the peaks identified with the duplicate reads and those identified without.
Comment