When I do a normal alignment of paired-end ChIP-seq reads to the full mouse genome I identify PCR duplicates by looking for fragments that have identical coordinates and identical sequences for both forward and reverse reads. Now I'm looking for repetitive sequences by aligning to an index made up of all sequences from the UCSC RepeatMasker track. I'm not sure how or if I should be removing duplicates. It seems like repetitive sequences may be more likely align to the same positions and still be legitimate so I don't think a blanket removal of all these duplicates is the best option.
I want to compare enrichment at repetitive sequences between multiple samples. Leaving in all duplicates may unfairly increase the magnitude of some changes between samples while removing them all may unfairly decrease some changes. For example, one particular repeat type is decreased in the treatment sample to 0.4 if I leave in duplicates and to 0.7 if I remove them. Maybe the true difference is somewhere in between there? In addition, in some cases the overall level of duplication between control and treatment samples is quite different (eg. 20% vs 30%) and it's hard to know if that's a real difference or not.
Any ideas about what to do here?
I want to compare enrichment at repetitive sequences between multiple samples. Leaving in all duplicates may unfairly increase the magnitude of some changes between samples while removing them all may unfairly decrease some changes. For example, one particular repeat type is decreased in the treatment sample to 0.4 if I leave in duplicates and to 0.7 if I remove them. Maybe the true difference is somewhere in between there? In addition, in some cases the overall level of duplication between control and treatment samples is quite different (eg. 20% vs 30%) and it's hard to know if that's a real difference or not.
Any ideas about what to do here?