Seqanswers Leaderboard Ad

**GenoMax** · 04-07-2015, 11:42 AM

Was the facility unwilling to demultiplex these for you? That seems kind of odd (unless you chose not to provide them with index information beforehand).

**luc** · 04-07-2015, 12:49 PM

The most flexible demultiplexing tool I am aware off is this:

500 Internal Server Error

http://comailab.genomecenter.ucdavis.edu/index.php/Barcoded_data_preparation_tools

I assume that it should work. Perhaps you have to remove the "+" between the barcodes or modify the script so that it ignores the "+".

**Brian Bushnell** · 04-07-2015, 02:34 PM

BBTools has a program, "demuxbyname", which will do this. Usage:

demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...

"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

Oh, and it's extremely fast.

**TompaB** · 04-08-2015, 10:38 AM

bbmap is the solution!

Thank you all for your help! The "demuxbyname.sh" approach works well and is fast.

Could I ask you two more things Brian? What I cannot manage to find with this this script is if there is a function to save the unmatched reads to a separate file. Is there such a function? The reason I would like this is that the reads are from NextSeq v1 chemistry and thus, there is a significant amount of reads that have missmatches in the indices. In this software, I cannot manage to find any function for allowing 1 or 2 missmatches (The Illumina demultiplex normally allows for 1 missmatch per index, i.e., a total of two missmatches).

For the other question why I need to do this. The core facility can and will demultiplex the file for me. They avoided it due to a misdirected kindness due to a miscommunication. It is just that I need the data really soon and they need some time to do it.

Thank you again!

**GenoMax** · 04-08-2015, 10:42 AM

Brian's programs share common options so you may want to try adding "outu=file_name" to your command to see if the unmatched reads are captured there.

**Brian Bushnell** · 04-08-2015, 12:39 PM

Hmmm, actually it doesn't have an "outu" flag right now; I'll add that for the next release.

We strictly throw away all reads with imperfect barcodes to minimize the risk of cross-contamination. But, I could add an option to the program to allow mismatches, I suppose; I might as well.

This will require a lot of memory, but if you want to capture all of the reads that did not have matching barcodes right now, you can do so like this:

1) Concatenate all of the output files that did have correct barcodes into a single file:
cat out_*_1.fq > combined.fq

2) Run filterbyname.sh:
filterbyname.sh in=r#.fq out=nonmatching#.fq names=combined.fq include=f

**Brian Bushnell** · 04-09-2015, 04:01 PM

Demuxbyname now supports an "outu" flag. Does not support substitutions yet, though.

**luc** · 10-03-2022, 04:58 PM

Does demuxbyname support wildcards by chance?
Thanks in advance!

**Brian Bushnell** · 10-04-2022, 10:03 AM

It does not explicitly support wildcards, but you also don't necessarily need to supply a list of exact names. For example, with standard Illumina headers that have a barcode in them (at least, in the format we generate them), you can demux into multiple files, one per barcode, without supplying a list of barcodes. Or you can match just a prefix, suffix, or substring (to a list of names) so the rest is implicitly a wildcard... in other words, you can match patterns like "foo*" or "*foo" or "*foo*", but not "foo*bar".

**superribosome** · 11-09-2022, 02:58 AM

Great tool. I was wondering, is there a way to get it to match on the first N nts of the name? My use case is the following: I have a big fastq file with unsplit indices. The indices were read as 9+9 but were in fact only indexed with 6mers. So they look like this:

@7001253F:517:CBKMUANXX:5:1107:5342:1998 1:N:0:GTGTGATCT+TCTTTCCCT

But only the first 6 nts of the suffix (i.e. GTGTGA) are actually part of the barcode.

All my barcodes are separated by hamming distance of 3. Ideally, I would like to separate the barcodes, allowing up to 2 mismatches in the barcode region only, ignoring mismatches in the non-barcode region, e.g.
GTGTGA => check for mismatches and sort read into bin
TCT+TCTTTCCCT => ignore mismatches in this region

So when I run this:
demuxbyname.sh in=S1_R1.fastq in2=S1_R2.fastq out=out_%_#.fq prefixmode=f names=barcode_names.txt hdist=2 outu=unmatched

I get reads in the unmatched file where some of the mismatches that led to read exclusion are outside the barcode region.

I tried the argument length=6, but it did not seem to solve the problem for me. I did not quite understand the documentation for the length argument, reproduced here:

length=0
"If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names.
For example, you could create files based on the first 8 characters of read names."

How do you specify if it is a suffix or prefix?
What does it mean "insted of or in addition to the list of names"?

I did see the argument substring, so perhaps I could use that and it would produce similar results to what I want for the most part, but it's not technically what I want, since I only want to consider matches in the first 6 nts as valid.

I realize there is a workaround in that I could write a script to make a duplicate fastq file where I've truncated the barcodes to 6, then run it, then recover the original by read ID, but I was wondering if there is already a built-in way to do this with your tool and I am missing something in my reading of the docs.

Thanks in advance!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Efficient way to split FASTQ files based on Illumina indexes in the ID

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News