Seqanswers Leaderboard Ad

**HESmith** · 03-23-2016, 12:41 PM

Illumina's bcl2fastq2 would probably be the fastest demultiplexer that's mismatch-tolerant (--barcode-mismatches flag, accepts 0-2, default=1). N.B.-this assumes you have access to the full run directory.

**spark** · 03-23-2016, 02:13 PM

Thanks - I'll look into using that. I've only had access to the fastq files to date, but I should be able to access the bcl files too.

**GenoMax** · 03-23-2016, 02:26 PM

Originally posted by spark View Post

Thanks - I'll look into using that. I've only had access to the fastq files to date, but I should be able to access the bcl files too.

That would be the easiest way.

You could enumerate all tags present in your files and then go back and use the demultiplexer from BBMap to get every read in the original file.

**Brian Bushnell** · 03-25-2016, 11:34 AM

Originally posted by spark View Post

I’ve used the Java demuxbyname demultiplexer from the BBMap package, and this demultiplexed the data in around 2 hours on a 32-core machine. However, 30% of reads were rejected because they failed to match the index sequences used.

Demuxbyname is fairly efficient, but mostly singlethreaded aside from compression, so 32 cores won't help (it reaches its cap at ~3 cores, unless the output files are gzipped and pigz is available). 2 hours sounds kind of long, though - the reason it is singlethreaded is because there is so little work to do, that adding threads won't help. At 2 hours, I wonder if the filesystem is limiting. You can determine (roughly) that by running top to see the CPU utilization of a process; if a singlethreaded process is under 100%, it's constrained by external things.

I designed it to reprocess Illumina's demultiplexing output, because Illumina software is very slow, and I needed to do multiple tests with a quick turnaround. I have found that one can write their own program faster than an Illumina program can generate output.

The Illumina software was configured to allow mismatches in barcodes. However, our experiments indicated a substantial cross-contamination rate in multiplexed libraries (which is extremely important in some experiments, such as single-cell). Allowing zero-mismatch barcodes improved the cross-contamination rate somewhat.

Essentially -
Do whatever you think is best for your experiment. But, those 30% of reads that were getting discarded... I highly recommend you discard them. They are being discarded for a reason. You can discard them with Illumina's software (which is the most efficient approach), or later with BBMap or whatever. Or you can keep them. But I cannot imagine an experiment that would benefit from importing an additional 30% of unreliable data into a dataset of reliable data.

**spark** · 03-25-2016, 04:21 PM

Thanks for the insights & advice - all very helpful.

We're doing a pilot study comparing sequence reads from a multiplexed PCR of 200 SNP assays with genotypes for the same SNPs generated using an alternative platform, so we're interested in exploring all aspects of the data for now, but I shall bear in mind that the reads with mismatching index sequences may give false results.

We've 218M reads from two lanes of a Hi Seq flowcell and 2,112 combinations of I5 and I7 adapters (22 I7 x 96 I5) to demultiplex. I'm not quite sure how long the demultiplexing took using your demuxbyname utility, though it ran in a short enough time to be practical, and is very convenient to use from a scripting point of view. One thing I did note is that demuxbyname processes continued running after all of the reads had been demultiplexed, so I had to interrupt my pipeline and kill the processes manually before I could proceed to genotype calling.

Code used:
demuxbyname.sh in=L001_R1_001.fastq.gz out=L1.%.fq.gz outu=L1.bad_index.fq.gz suffix names=bbm_indices ow=true

... where bbm_indices contains the list of 2112 I5 x I7 combinations

**Brian Bushnell** · 03-25-2016, 07:37 PM

Originally posted by spark View Post

One thing I did note is that demuxbyname processes continued running after all of the reads had been demultiplexed, so I had to interrupt my pipeline and kill the processes manually before I could proceed to genotype calling.

That's interesting. Thanks for sharing! I'm not sure I'll be able to fix it, but I will look into it... were all of the output files complete and formatted correctly? It sounds to me like the program was waiting for some output stream to finish, which never happened, so I suggest verifying that the number of reads in all of the output files actually add up to the number reported. Since your output was gzipped, if anything didn't complete, the output file would be invalid. So -

cat L1.*.fq.gz | reformat.sh in=stdin.fq.gz pigz=f gzip=f unpigz=f gunzip=f

That will read the files, count the number of reads, and crash with some kind of error message if any of the gzipped files are corrupt (the pigz/gzip flags tell it to not spawn additional subprocesses for compression/decompression). I've never used demuxbyname with more than ~100 output files. Is pigz installed? You can check with the command "pigz"; if it is not installed, you'll get some variety of "command not found" message. For some reason I set demuxbyname to use pigz by default, which was a bad idea as in cases like this it will try to spawn 2,112 pigz processes (if it is installed) which could cause problems.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

demultiplexing dual index Illumina reads with mismatch tolerance

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News