Hello all.
A biologist colleague has generated sequencing libraries based on a SELEX type of enrichment of artificial aptamers that have bound to bacterial cells. He will have Ion Torrent sequence (short read, single end) output (millions of reads per sample) and would like help with identifying sequences that uniquely or preferentially bind each bacterial strain. The aptamers are 80 nucleotides long and are generated randomly. There are several rounds of enrichment, so there will be sequences represented multiple times. Biological replicates will help find true positives.
Ideally he would find sequences that are present exclusively in each bacterial strain's bound aptamer population. Initially we'll look at the full length aptamers but of course specific motifs present in different aptamers may be enriched.
Does anyone know if there is a Bioconductor (or other) package that will already do this kind of counting short reads and comparing between samples?
This can be done in Perl, for example, using hashes and counting each sequence (and potentially each kmer in the reads) but I suspect there will be a better way to do it. We don't wan to reinvent the wheel and would like to reuse anyone's good ideas and code.
Any help or thoughts would be greatly appreciated.
Donald
A biologist colleague has generated sequencing libraries based on a SELEX type of enrichment of artificial aptamers that have bound to bacterial cells. He will have Ion Torrent sequence (short read, single end) output (millions of reads per sample) and would like help with identifying sequences that uniquely or preferentially bind each bacterial strain. The aptamers are 80 nucleotides long and are generated randomly. There are several rounds of enrichment, so there will be sequences represented multiple times. Biological replicates will help find true positives.
Ideally he would find sequences that are present exclusively in each bacterial strain's bound aptamer population. Initially we'll look at the full length aptamers but of course specific motifs present in different aptamers may be enriched.
Does anyone know if there is a Bioconductor (or other) package that will already do this kind of counting short reads and comparing between samples?
This can be done in Perl, for example, using hashes and counting each sequence (and potentially each kmer in the reads) but I suspect there will be a better way to do it. We don't wan to reinvent the wheel and would like to reuse anyone's good ideas and code.
Any help or thoughts would be greatly appreciated.
Donald
Comment