I recently noticed that UCSC genome data such as the one for C. elegans below (30MB file)
contains lower case bases in the sequence for repeats or low complexity regions. I would like to mask them out for my mapping or variant calling by creating annotations out of those regions.
The only two format I can think of are BED and GFF but I wonder if anyone has a better idea on how to do that or if there is already an existing tool on UCSC / other tools to do so. TIA.
contains lower case bases in the sequence for repeats or low complexity regions. I would like to mask them out for my mapping or variant calling by creating annotations out of those regions.
The only two format I can think of are BED and GFF but I wonder if anyone has a better idea on how to do that or if there is already an existing tool on UCSC / other tools to do so. TIA.
Comment