I have run CD-HIT on a set of DNA sequences (171 million). Now they are nicely clustered in a file ... but how can I efficiently retrieve all DNA sequences in clusters in particular frequency classes?
There is a nice utility in CD-HIT (plot_len1.pl) which gives me a table with sequence frequencies for various length classes. So all the frequency information is in the .clstr file, but how do I get only the information out that I want... and how do I link that then back to the original sequences? Lets say I want to retrieve all sequences that occur from 10 to 19 times in my input dataset?
There is a nice utility in CD-HIT (plot_len1.pl) which gives me a table with sequence frequencies for various length classes. So all the frequency information is in the .clstr file, but how do I get only the information out that I want... and how do I link that then back to the original sequences? Lets say I want to retrieve all sequences that occur from 10 to 19 times in my input dataset?