Hi,
Some background:
I scanned the Arabidopsis thaliana genome for unique k-mer sites, in this instance, 20-mer sites. I started with an incredibly ambiguous 20-mer, NHNHNHNHNHNHNHNHNHGG, which contains ~5x10^9 possible sequences. In my scanning of the genome, I found ~90,000 unique sites.
Problem:
I want to somehow collapse these 90,000 unique 20-mers into a smaller set of ambiguous (IUPAC) sequences (ideally 100 or less total sequences) containing all 90,000 sequences, but not including the other 10^9 sequences (or as many as possible) originally contained in my NHNHNHNHNHNHNHNHNHGG motif.
I have no idea how to go about solving this problem with a script, or any available tools to do this.
If anyone can give me any advice, thanks a bunch!
Best,
Derrick
Some background:
I scanned the Arabidopsis thaliana genome for unique k-mer sites, in this instance, 20-mer sites. I started with an incredibly ambiguous 20-mer, NHNHNHNHNHNHNHNHNHGG, which contains ~5x10^9 possible sequences. In my scanning of the genome, I found ~90,000 unique sites.
Problem:
I want to somehow collapse these 90,000 unique 20-mers into a smaller set of ambiguous (IUPAC) sequences (ideally 100 or less total sequences) containing all 90,000 sequences, but not including the other 10^9 sequences (or as many as possible) originally contained in my NHNHNHNHNHNHNHNHNHGG motif.
I have no idea how to go about solving this problem with a script, or any available tools to do this.
If anyone can give me any advice, thanks a bunch!
Best,
Derrick