SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   demultiplexing dual index Illumina reads with mismatch tolerance (http://seqanswers.com/forums/showthread.php?t=67090)

spark 03-23-2016 09:51 AM

demultiplexing dual index Illumina reads with mismatch tolerance
 
I'm analysing genotyping-by-sequencing data for 2,100 dual-indexed samples and am looking for a fast, mismatch-tolerant demultiplexer.

Libraries for each sample were prepared using a unique combination of Illumina I5 and I7 index sequences. Libraries were pooled and sequenced using two Hiseq lanes.

Iíve used the Java demuxbyname demultiplexer from the BBMap package, and this demultiplexed the data in around 2 hours on a 32-core machine. However, 30% of reads were rejected because they failed to match the index sequences used.

Pairwise Hamming distances for the I5 sequences (96 in all, one for each well position) vary between 3 and 6, so I'm hoping there shouldn't be a problem tolerating one mismatch there.

The I7 index sequences have pairwise Hamming distances between 2 and 6, so I would need to correct mismatching index reads only where they were one mismatch away from a single genuine index sequence and to reject them otherwise.

Would anyone know of a fast demultiplexer that could offer some or all of this functionality?

Thanks,

Stephen

HESmith 03-23-2016 12:41 PM

Illumina's bcl2fastq2 would probably be the fastest demultiplexer that's mismatch-tolerant (--barcode-mismatches flag, accepts 0-2, default=1). N.B.-this assumes you have access to the full run directory.

spark 03-23-2016 02:13 PM

Thanks - I'll look into using that. I've only had access to the fastq files to date, but I should be able to access the bcl files too.

GenoMax 03-23-2016 02:26 PM

Quote:

Originally Posted by spark (Post 191306)
Thanks - I'll look into using that. I've only had access to the fastq files to date, but I should be able to access the bcl files too.

That would be the easiest way.

You could enumerate all tags present in your files and then go back and use the demultiplexer from BBMap to get every read in the original file.

Brian Bushnell 03-25-2016 11:34 AM

Quote:

Originally Posted by spark (Post 191298)
Iíve used the Java demuxbyname demultiplexer from the BBMap package, and this demultiplexed the data in around 2 hours on a 32-core machine. However, 30% of reads were rejected because they failed to match the index sequences used.

Demuxbyname is fairly efficient, but mostly singlethreaded aside from compression, so 32 cores won't help (it reaches its cap at ~3 cores, unless the output files are gzipped and pigz is available). 2 hours sounds kind of long, though - the reason it is singlethreaded is because there is so little work to do, that adding threads won't help. At 2 hours, I wonder if the filesystem is limiting. You can determine (roughly) that by running top to see the CPU utilization of a process; if a singlethreaded process is under 100%, it's constrained by external things.

I designed it to reprocess Illumina's demultiplexing output, because Illumina software is very slow, and I needed to do multiple tests with a quick turnaround. I have found that one can write their own program faster than an Illumina program can generate output.

The Illumina software was configured to allow mismatches in barcodes. However, our experiments indicated a substantial cross-contamination rate in multiplexed libraries (which is extremely important in some experiments, such as single-cell). Allowing zero-mismatch barcodes improved the cross-contamination rate somewhat.

Essentially -
Do whatever you think is best for your experiment. But, those 30% of reads that were getting discarded... I highly recommend you discard them. They are being discarded for a reason. You can discard them with Illumina's software (which is the most efficient approach), or later with BBMap or whatever. Or you can keep them. But I cannot imagine an experiment that would benefit from importing an additional 30% of unreliable data into a dataset of reliable data.

spark 03-25-2016 04:21 PM

Thanks for the insights & advice - all very helpful.

We're doing a pilot study comparing sequence reads from a multiplexed PCR of 200 SNP assays with genotypes for the same SNPs generated using an alternative platform, so we're interested in exploring all aspects of the data for now, but I shall bear in mind that the reads with mismatching index sequences may give false results.

We've 218M reads from two lanes of a Hi Seq flowcell and 2,112 combinations of I5 and I7 adapters (22 I7 x 96 I5) to demultiplex. I'm not quite sure how long the demultiplexing took using your demuxbyname utility, though it ran in a short enough time to be practical, and is very convenient to use from a scripting point of view. One thing I did note is that demuxbyname processes continued running after all of the reads had been demultiplexed, so I had to interrupt my pipeline and kill the processes manually before I could proceed to genotype calling.

Code used:
demuxbyname.sh in=L001_R1_001.fastq.gz out=L1.%.fq.gz outu=L1.bad_index.fq.gz suffix names=bbm_indices ow=true

... where bbm_indices contains the list of 2112 I5 x I7 combinations

Brian Bushnell 03-25-2016 07:37 PM

Quote:

Originally Posted by spark (Post 191408)
One thing I did note is that demuxbyname processes continued running after all of the reads had been demultiplexed, so I had to interrupt my pipeline and kill the processes manually before I could proceed to genotype calling.

That's interesting. Thanks for sharing! I'm not sure I'll be able to fix it, but I will look into it... were all of the output files complete and formatted correctly? It sounds to me like the program was waiting for some output stream to finish, which never happened, so I suggest verifying that the number of reads in all of the output files actually add up to the number reported. Since your output was gzipped, if anything didn't complete, the output file would be invalid. So -

cat L1.*.fq.gz | reformat.sh in=stdin.fq.gz pigz=f gzip=f unpigz=f gunzip=f

That will read the files, count the number of reads, and crash with some kind of error message if any of the gzipped files are corrupt (the pigz/gzip flags tell it to not spawn additional subprocesses for compression/decompression). I've never used demuxbyname with more than ~100 output files. Is pigz installed? You can check with the command "pigz"; if it is not installed, you'll get some variety of "command not found" message. For some reason I set demuxbyname to use pigz by default, which was a bad idea as in cases like this it will try to spawn 2,112 pigz processes (if it is installed) which could cause problems.


All times are GMT -8. The time now is 03:24 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.