We have single-cell transcriptomic reads, where each read has a barcode associated with it which designates which cell it came from. From the default pipeline (a company called 10X Genomics), about 1/3 of reads' barcodes did not pass quality control and we are seeing if we can stretch this and save some of the data. So essentially, each read has a barcode of 14bp attached to it, and there is a whitelist of allowed barcodes (all also of 14bp). We are trying to use Bowtie2 to align the barcodes of reads that did not pass quality control (generally they all have one or more bases different from any whitelisted barcodes) back to the reference whitelist to see which barcodes in the whitelist are most similar to each read, and we may decide to keep more of the data if reads have whitelisted barcodes similar enough to their own.
Essentially, we are making mock .fastq and .fasta files from the reads' barcodes and whitelisted barcodes respectively, and running alignment with the whitelisted barcodes as the "reference genome". We are wondering if there is a better way/tool to do this, since I believe Bowtie2 was designed for longer length (50+ bp) reads (all barcodes are 14bp long). It looks like the first version of Bowtie is more geared toward shorter read length, but the manual still says it was intended for reads of length 25-50bp, and we are wondering if there is a tool designed for even shorter read lengths, or if someone has insights on whether Bowtie2/Bowtie is giving accurate results. I first tried writing a Perl script to do the check but it took a minute for only 30 reads (one sample has 30 million...). Thanks.
*Edit
So basically, we are doing normal alignment but with several differences:
-the read lengths and reference sequences are all very short (14bp)
-all data is exactly the same length, so there is no need for gapped alignment
-all reads are single-ended
Essentially, we are making mock .fastq and .fasta files from the reads' barcodes and whitelisted barcodes respectively, and running alignment with the whitelisted barcodes as the "reference genome". We are wondering if there is a better way/tool to do this, since I believe Bowtie2 was designed for longer length (50+ bp) reads (all barcodes are 14bp long). It looks like the first version of Bowtie is more geared toward shorter read length, but the manual still says it was intended for reads of length 25-50bp, and we are wondering if there is a tool designed for even shorter read lengths, or if someone has insights on whether Bowtie2/Bowtie is giving accurate results. I first tried writing a Perl script to do the check but it took a minute for only 30 reads (one sample has 30 million...). Thanks.
*Edit
So basically, we are doing normal alignment but with several differences:
-the read lengths and reference sequences are all very short (14bp)
-all data is exactly the same length, so there is no need for gapped alignment
-all reads are single-ended
Comment