Hello all:
I have an issue that I don't think has been covered yet in the SEQanswers community.
We recently performed CAGE (http://en.wikipedia.org/wiki/Cap_ana...ene_expression), which is a method for large-scale profiling of 5' mRNA ends, and obtained a large number of high quality Illumina HiSeq SE reads of 50bp.
This experiment contains data from 8 separate experiments, and so the 5' ends of the reads were barcoded with 8 trinucleotides, as follows:
sample_1 ACC
sample_2 CAC
sample_3 AGT
sample_4 GCG
sample_5 ATG
sample_6 TAC
sample_7 ACG
sample_8 GCT
Of course, these samples need to be demultiplexed so they can be analyzed separately. I did so using the FASTX-Toolkit's FASTX Barcode Splitter (http://seqanswers.com/forums/newthre...ostthread&f=18) as follows:
cat myCompleteCAGEfile.fastq | fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --exact --suffix ".txt" --prefix /my_directory/demulti-
I chose the --exact flag because the barcodes are only three bases in length, so I reasoned it was best to demand a precise match and then rescue the unmatched reads after the fact.
The above demultiplexed job worked well, and I was left with a small (<5%) but not insignificant number of unmatched reads. The largest class of these unmatched reads have an N at the first base.
For example, one of the first reads begins with:
NCTGAGAGCGG...
For which the barcode (N)CT would correspond to Sample 8: GCT.
I ran the fastx_barcode_splitter.pl command again but with a tolerance for a single mismatch, but this causes a conflict between possible barcodes and as far as I know this command does not allow for specifying mismatch tolerance at a specific base, which would be ideal in this case. Also, creating a degenerate barcode file including the N is not tolerated by the program either.
I've considered using a set of piped linux commands, including cut and sed, but this would be trickier than it needs to be, and I expect there is another way to rescue these 'single leading N' unmatched reads. Can anyone point me in another direction? It may be possible to do this using CASAVA, but I very limited experience with that software package.
Thanks in advance,
Taylor
I have an issue that I don't think has been covered yet in the SEQanswers community.
We recently performed CAGE (http://en.wikipedia.org/wiki/Cap_ana...ene_expression), which is a method for large-scale profiling of 5' mRNA ends, and obtained a large number of high quality Illumina HiSeq SE reads of 50bp.
This experiment contains data from 8 separate experiments, and so the 5' ends of the reads were barcoded with 8 trinucleotides, as follows:
sample_1 ACC
sample_2 CAC
sample_3 AGT
sample_4 GCG
sample_5 ATG
sample_6 TAC
sample_7 ACG
sample_8 GCT
Of course, these samples need to be demultiplexed so they can be analyzed separately. I did so using the FASTX-Toolkit's FASTX Barcode Splitter (http://seqanswers.com/forums/newthre...ostthread&f=18) as follows:
cat myCompleteCAGEfile.fastq | fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --exact --suffix ".txt" --prefix /my_directory/demulti-
I chose the --exact flag because the barcodes are only three bases in length, so I reasoned it was best to demand a precise match and then rescue the unmatched reads after the fact.
The above demultiplexed job worked well, and I was left with a small (<5%) but not insignificant number of unmatched reads. The largest class of these unmatched reads have an N at the first base.
For example, one of the first reads begins with:
NCTGAGAGCGG...
For which the barcode (N)CT would correspond to Sample 8: GCT.
I ran the fastx_barcode_splitter.pl command again but with a tolerance for a single mismatch, but this causes a conflict between possible barcodes and as far as I know this command does not allow for specifying mismatch tolerance at a specific base, which would be ideal in this case. Also, creating a degenerate barcode file including the N is not tolerated by the program either.
I've considered using a set of piped linux commands, including cut and sed, but this would be trickier than it needs to be, and I expect there is another way to rescue these 'single leading N' unmatched reads. Can anyone point me in another direction? It may be possible to do this using CASAVA, but I very limited experience with that software package.
Thanks in advance,
Taylor
Comment