I have Illumina PE data from a MiSeq run and am finding large discrepancies in the number of reads recovered while demultiplexing.
Originally, the fastq files provided with the run had very few reads per individual and a large (>4Gb) Undetermined file. I investigated the DemultiplexSummary provided with the run and found that the top 30 indexes found were in fact my indexes but for some reason (??) were not de-multiplexed correctly.
Using the Fastx-barcode splitter (HammonLab), allowing a 1nucleotide mismatch, I was able to recover a small number of additional reads from the Undetermined file but not nearly the number stated in the summary.
I then inputted the raw data into Geneious and after trimming the adapters with the bbduk input I de-multiplexed and found significantly more reads but still only approximately half of the number that should be present.
Some Numbers:
Barcode CAAAAG/CTTTTG
DemultiplexSummary: 1,119,171 reads
Fastq file unaltered from run: 1, 428 reads
Fastx Barcode Splitter (on undetermined file): 118,031 reads
Geneious: 574,498 reads
My questions are:
1 - why are the different programs giving such disparate results?
2- Am I misunderstanding the orientation of the barcodes in the reads and thus perhaps searching for them incorrectly? It is my understanding that the "adapter - barcode - read" order should have the barcode as the first 6 bases in the R1 read after adapter trimming (in my example CAAAAG). R2 should not have(?) the barcode - or I should not have to search for a barcode in R2 anyways as I have paired the data? I recovered the reads in Geneious using CTTTTG as that was the index listed in the demultiplexsummary but my barcode as listed in my primer order was CAAAAG so I am concerned that I am misunderstanding a fundamental piece of the puzzle here.
3 - Most importantly - how do I recover the 1million+ reads?!!?
Originally, the fastq files provided with the run had very few reads per individual and a large (>4Gb) Undetermined file. I investigated the DemultiplexSummary provided with the run and found that the top 30 indexes found were in fact my indexes but for some reason (??) were not de-multiplexed correctly.
Using the Fastx-barcode splitter (HammonLab), allowing a 1nucleotide mismatch, I was able to recover a small number of additional reads from the Undetermined file but not nearly the number stated in the summary.
I then inputted the raw data into Geneious and after trimming the adapters with the bbduk input I de-multiplexed and found significantly more reads but still only approximately half of the number that should be present.
Some Numbers:
Barcode CAAAAG/CTTTTG
DemultiplexSummary: 1,119,171 reads
Fastq file unaltered from run: 1, 428 reads
Fastx Barcode Splitter (on undetermined file): 118,031 reads
Geneious: 574,498 reads
My questions are:
1 - why are the different programs giving such disparate results?
2- Am I misunderstanding the orientation of the barcodes in the reads and thus perhaps searching for them incorrectly? It is my understanding that the "adapter - barcode - read" order should have the barcode as the first 6 bases in the R1 read after adapter trimming (in my example CAAAAG). R2 should not have(?) the barcode - or I should not have to search for a barcode in R2 anyways as I have paired the data? I recovered the reads in Geneious using CTTTTG as that was the index listed in the demultiplexsummary but my barcode as listed in my primer order was CAAAAG so I am concerned that I am misunderstanding a fundamental piece of the puzzle here.
3 - Most importantly - how do I recover the 1million+ reads?!!?
Comment