View Single Post
Old 03-25-2016, 11:34 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by spark View Post
Iíve used the Java demuxbyname demultiplexer from the BBMap package, and this demultiplexed the data in around 2 hours on a 32-core machine. However, 30% of reads were rejected because they failed to match the index sequences used.
Demuxbyname is fairly efficient, but mostly singlethreaded aside from compression, so 32 cores won't help (it reaches its cap at ~3 cores, unless the output files are gzipped and pigz is available). 2 hours sounds kind of long, though - the reason it is singlethreaded is because there is so little work to do, that adding threads won't help. At 2 hours, I wonder if the filesystem is limiting. You can determine (roughly) that by running top to see the CPU utilization of a process; if a singlethreaded process is under 100%, it's constrained by external things.

I designed it to reprocess Illumina's demultiplexing output, because Illumina software is very slow, and I needed to do multiple tests with a quick turnaround. I have found that one can write their own program faster than an Illumina program can generate output.

The Illumina software was configured to allow mismatches in barcodes. However, our experiments indicated a substantial cross-contamination rate in multiplexed libraries (which is extremely important in some experiments, such as single-cell). Allowing zero-mismatch barcodes improved the cross-contamination rate somewhat.

Essentially -
Do whatever you think is best for your experiment. But, those 30% of reads that were getting discarded... I highly recommend you discard them. They are being discarded for a reason. You can discard them with Illumina's software (which is the most efficient approach), or later with BBMap or whatever. Or you can keep them. But I cannot imagine an experiment that would benefit from importing an additional 30% of unreliable data into a dataset of reliable data.
Brian Bushnell is offline   Reply With Quote