ddunbar

Identifying unique aptamer sequences
Hello all.
A biologist colleague has generated sequencing libraries based on a SELEX type of enrichment of artificial aptamers that have bound to bacterial cells. He will have Ion Torrent sequence (short read, single end) output (millions of reads per sample) and would like help with identifying sequences that uniquely or preferentially bind each bacterial strain. The aptamers are 80 nucleotides long and are generated randomly. There are several rounds of enrichment, so there will be sequences represented multiple times. Biological replicates will help find true positives.

Ideally he would find sequences that are present exclusively in each bacterial strain's bound aptamer population. Initially we'll look at the full length aptamers but of course specific motifs present in different aptamers may be enriched.

Does anyone know if there is a Bioconductor (or other) package that will already do this kind of counting short reads and comparing between samples?

This can be done in Perl, for example, using hashes and counting each sequence (and potentially each kmer in the reads) but I suspect there will be a better way to do it. We don't wan to reinvent the wheel and would like to reuse anyone's good ideas and code.

Any help or thoughts would be greatly appreciated.


dawe

You may deal with aptamer sequences using only bash utilities with grep, awk, sort and uniq.
First of all you have to put all sequences (one per line) in a file, then

$ sort file | uniq -c | sort -k1,1n > counted_sequences

You will end up with hundreds (or thousands) out of millions with a power law enrichment count.
Once you have all files, counted, it's easy with grep to check counts across SELEX cycles or samples.


ddunbar

Many thanks for that dawe. Works nicely.
Best wishes,

