Hi all,
I'm fairly new to the realm of bioinformatics with large data sets, so apologies if I've missed something crucial here...
I've recently received some Illumina HiSeq2500 data in FASTQ format which haven't been demultiplexed. We've used custom i5 and i7 sequences in unique combinations for 96 samples. I was given the data in 8 FASTQ files, 2 per lane (4 lanes) with paired-ends. I've concatenated all of the forward and all of the reverse reads into 2 files for simplicity. I've been using the demuxbyname.sh method through BBMap - but I keep running into a couple of problems:
Any help on either of these points is greatly appreciated!
I'm fairly new to the realm of bioinformatics with large data sets, so apologies if I've missed something crucial here...
I've recently received some Illumina HiSeq2500 data in FASTQ format which haven't been demultiplexed. We've used custom i5 and i7 sequences in unique combinations for 96 samples. I was given the data in 8 FASTQ files, 2 per lane (4 lanes) with paired-ends. I've concatenated all of the forward and all of the reverse reads into 2 files for simplicity. I've been using the demuxbyname.sh method through BBMap - but I keep running into a couple of problems:
1. When I run demuxbyname.sh with a single string I only receive ~2500 reads in the output files. I've noticed that a lot of the index sequences in the FASTQ files contain N's - especially as the first base call (for i5 and i7).
2. This generally takes ~3hrs, but when I then attempt to run the script with an index.txt file containing multiple index combinations, the compute time increases exponentially.
Any help on either of these points is greatly appreciated!
Comment