What we're trying to do:
Illumina single end reads
~30 million text strings made up of approximately 400,000 unique, unknown 20bp sequences flanked by known sequence (a "tag" if you will)
need to count the frequency of each barcode sequence (with up to a 2bp mismatch) for entire data set
the problem is that we don't have a reference to match the data to, as the sequences are unknown. Has anyone done anything like this, or know of software that might be able to do this?
currently using matlab and a brute force technique, in which we compare each new sequence to all of the others before it, increase by one if it matches, or add it to the list if it is unique. This process is going to be exceedingly slow, hoping there is a better way!
Thanks in advance!
Illumina single end reads
~30 million text strings made up of approximately 400,000 unique, unknown 20bp sequences flanked by known sequence (a "tag" if you will)
need to count the frequency of each barcode sequence (with up to a 2bp mismatch) for entire data set
the problem is that we don't have a reference to match the data to, as the sequences are unknown. Has anyone done anything like this, or know of software that might be able to do this?
currently using matlab and a brute force technique, in which we compare each new sequence to all of the others before it, increase by one if it matches, or add it to the list if it is unique. This process is going to be exceedingly slow, hoping there is a better way!
Thanks in advance!
Comment