SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with demultiplexing inline dual barcodes on paired end Illumina sequences Kurt Lamour Bioinformatics 0 10-05-2015 11:41 AM
Demultiplexing Illumina RNASeq paired reads Bacms Bioinformatics 8 01-14-2015 08:33 AM
Demultiplexing PE Illumina reads Girma Illumina/Solexa 3 07-23-2014 10:16 AM
Demultiplexing illumina reads Bioinform Bioinformatics 4 02-25-2014 09:37 AM
Demultiplexing Illumina reads Bardj Bioinformatics 1 02-16-2011 10:17 AM

Reply
 
Thread Tools
Old 03-23-2016, 09:51 AM   #1
spark
Junior Member
 
Location: Dublin

Join Date: Mar 2009
Posts: 6
Default demultiplexing dual index Illumina reads with mismatch tolerance

I'm analysing genotyping-by-sequencing data for 2,100 dual-indexed samples and am looking for a fast, mismatch-tolerant demultiplexer.

Libraries for each sample were prepared using a unique combination of Illumina I5 and I7 index sequences. Libraries were pooled and sequenced using two Hiseq lanes.

Iíve used the Java demuxbyname demultiplexer from the BBMap package, and this demultiplexed the data in around 2 hours on a 32-core machine. However, 30% of reads were rejected because they failed to match the index sequences used.

Pairwise Hamming distances for the I5 sequences (96 in all, one for each well position) vary between 3 and 6, so I'm hoping there shouldn't be a problem tolerating one mismatch there.

The I7 index sequences have pairwise Hamming distances between 2 and 6, so I would need to correct mismatching index reads only where they were one mismatch away from a single genuine index sequence and to reject them otherwise.

Would anyone know of a fast demultiplexer that could offer some or all of this functionality?

Thanks,

Stephen
spark is offline   Reply With Quote
Old 03-23-2016, 12:41 PM   #2
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 498
Default

Illumina's bcl2fastq2 would probably be the fastest demultiplexer that's mismatch-tolerant (--barcode-mismatches flag, accepts 0-2, default=1). N.B.-this assumes you have access to the full run directory.
HESmith is offline   Reply With Quote
Old 03-23-2016, 02:13 PM   #3
spark
Junior Member
 
Location: Dublin

Join Date: Mar 2009
Posts: 6
Default

Thanks - I'll look into using that. I've only had access to the fastq files to date, but I should be able to access the bcl files too.
spark is offline   Reply With Quote
Old 03-23-2016, 02:26 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,784
Default

Quote:
Originally Posted by spark View Post
Thanks - I'll look into using that. I've only had access to the fastq files to date, but I should be able to access the bcl files too.
That would be the easiest way.

You could enumerate all tags present in your files and then go back and use the demultiplexer from BBMap to get every read in the original file.
GenoMax is offline   Reply With Quote
Old 03-25-2016, 11:34 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by spark View Post
Iíve used the Java demuxbyname demultiplexer from the BBMap package, and this demultiplexed the data in around 2 hours on a 32-core machine. However, 30% of reads were rejected because they failed to match the index sequences used.
Demuxbyname is fairly efficient, but mostly singlethreaded aside from compression, so 32 cores won't help (it reaches its cap at ~3 cores, unless the output files are gzipped and pigz is available). 2 hours sounds kind of long, though - the reason it is singlethreaded is because there is so little work to do, that adding threads won't help. At 2 hours, I wonder if the filesystem is limiting. You can determine (roughly) that by running top to see the CPU utilization of a process; if a singlethreaded process is under 100%, it's constrained by external things.

I designed it to reprocess Illumina's demultiplexing output, because Illumina software is very slow, and I needed to do multiple tests with a quick turnaround. I have found that one can write their own program faster than an Illumina program can generate output.

The Illumina software was configured to allow mismatches in barcodes. However, our experiments indicated a substantial cross-contamination rate in multiplexed libraries (which is extremely important in some experiments, such as single-cell). Allowing zero-mismatch barcodes improved the cross-contamination rate somewhat.

Essentially -
Do whatever you think is best for your experiment. But, those 30% of reads that were getting discarded... I highly recommend you discard them. They are being discarded for a reason. You can discard them with Illumina's software (which is the most efficient approach), or later with BBMap or whatever. Or you can keep them. But I cannot imagine an experiment that would benefit from importing an additional 30% of unreliable data into a dataset of reliable data.
Brian Bushnell is offline   Reply With Quote
Old 03-25-2016, 04:21 PM   #6
spark
Junior Member
 
Location: Dublin

Join Date: Mar 2009
Posts: 6
Default

Thanks for the insights & advice - all very helpful.

We're doing a pilot study comparing sequence reads from a multiplexed PCR of 200 SNP assays with genotypes for the same SNPs generated using an alternative platform, so we're interested in exploring all aspects of the data for now, but I shall bear in mind that the reads with mismatching index sequences may give false results.

We've 218M reads from two lanes of a Hi Seq flowcell and 2,112 combinations of I5 and I7 adapters (22 I7 x 96 I5) to demultiplex. I'm not quite sure how long the demultiplexing took using your demuxbyname utility, though it ran in a short enough time to be practical, and is very convenient to use from a scripting point of view. One thing I did note is that demuxbyname processes continued running after all of the reads had been demultiplexed, so I had to interrupt my pipeline and kill the processes manually before I could proceed to genotype calling.

Code used:
demuxbyname.sh in=L001_R1_001.fastq.gz out=L1.%.fq.gz outu=L1.bad_index.fq.gz suffix names=bbm_indices ow=true

... where bbm_indices contains the list of 2112 I5 x I7 combinations
spark is offline   Reply With Quote
Old 03-25-2016, 07:37 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by spark View Post
One thing I did note is that demuxbyname processes continued running after all of the reads had been demultiplexed, so I had to interrupt my pipeline and kill the processes manually before I could proceed to genotype calling.
That's interesting. Thanks for sharing! I'm not sure I'll be able to fix it, but I will look into it... were all of the output files complete and formatted correctly? It sounds to me like the program was waiting for some output stream to finish, which never happened, so I suggest verifying that the number of reads in all of the output files actually add up to the number reported. Since your output was gzipped, if anything didn't complete, the output file would be invalid. So -

cat L1.*.fq.gz | reformat.sh in=stdin.fq.gz pigz=f gzip=f unpigz=f gunzip=f

That will read the files, count the number of reads, and crash with some kind of error message if any of the gzipped files are corrupt (the pigz/gzip flags tell it to not spawn additional subprocesses for compression/decompression). I've never used demuxbyname with more than ~100 output files. Is pigz installed? You can check with the command "pigz"; if it is not installed, you'll get some variety of "command not found" message. For some reason I set demuxbyname to use pigz by default, which was a bad idea as in cases like this it will try to spawn 2,112 pigz processes (if it is installed) which could cause problems.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
demultiplexing, dual indexing, illumina

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:50 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO