Seqanswers Leaderboard Ad

**Brian Bushnell** · 02-16-2017, 07:01 PM

The best way to do this is typically with BBDuk or Seal, which operate on kmer-matching but allow a hamming distance. They're ideal for 14bp sequences, and are very fast. For example:

bbduk.sh in=reads.fq ref=whitelist.fa outm=match.fq out=unmatched.fq k=14 mm=f

This will put all the reads in match or unmatched, requiring exact matches. If you want to allow mismatches, use "hdist"; for example, this will allow 3 mismatches:

bbduk.sh in=reads.fq ref=whitelist.fa outm=match.fq out=unmatched.fq k=14 mm=f hdist=3

Alternatively, Seal will split the input file into one output file per reference sequence:

seal.sh in=reads.fq ref=whitelist.fa pattern=match_%.fq out=unmatched.fq k=14 mm=f hdist=3

You can add the "ordered" flag if you want the reads to remain in input order. Note that it is also possible to do this via alignment (e.g. with bowtie1 or with BBMap), which would produce a sam file, but it may require tweaking default parameters. With BBMap for example it would be something like:

bbmap.sh ref=whitelist.fa in=reads.fq out=mapped.sam k=7 vslow strictmaxindel=0

...but, I think Seal or BBDuk would be more convenient.

**bloosnail** · 02-17-2017, 12:34 PM

Hi Brian,

Thank you for your response, this seems to be useful for our purposes. I am wondering if BBDuk has some functionality to output the actual sequence in the .fa that each read aligned to, not just whether a read has valid alignments or not. Our specific end goal is to find the subset of whitelisted reads that is most similar to each barcode.

Additionally, we are only trying to find the subset of whitelisted barcodes with the minimum Hamming distance for each barcode, not which barcodes have a reference within a certain Hamming distance away. So I think (if it is able to output the sequence of the reference .fa that a read is aligned it) we could run BBDuk iteratively over increasing Hamming distance from 1-14, then starting from Hamming distance of 1 up to 14, check if a read has passed yet or not, and exclude it from later iterations once it has passed.

So specifically, my main question is whether BBDuk can output the the actual reference sequence that a read aligned to, or if you have some recommendation on a tool that can do that along with the other criteria listed previously? Thanks.

**Brian Bushnell** · 02-17-2017, 05:59 PM

Actually, BBDuk will not go over a hamming distance of around 7 very easily. The exact limit depends on the kmer length, number of sequences, and amount of memory, but the time and memory increase exponentially with (14*3)^(hamming distance). But then, finding a 14-mer match with a hamming distance of 14 is not very interesting anyway

Anyway, iteratively increasing hamming distance from 0 to some limit is probably the best way to go, unless there's a specific program that simply brute-force calculate all the hamming distances... I do have one like that for bar codes (where there's no computational cost for mismatches, unlike BBDuk or alignment algorithms) but it's not quite designed for this use. But if this is a common need for processing 10X data I'll plan to modify it to suit this purpose exactly (we don't use 10X data currently, though I am investigating it for the purpose of scaffolding).

Anyway - if you use Seal, you will get one file for each sequence in the reference fasta, containing all the reads matching that sequence, which sounds possibly like what you want. Alternatively, you could do this:

Code:

bbduk.sh in=reads.fq ref=whitelist.fa outm=matched0.fq out=unmatched0.fq k=14 mm=f hdist=0 rename fbm

bbduk.sh in=unmatched0.fq ref=whitelist.fa outm=matched1.fq out=unmatched1.fq k=14 mm=f hdist=1 rename fbm

bbduk.sh in=unmatched1.fq ref=whitelist.fa outm=matched2.fq out=unmatched2.fq k=14 mm=f hdist=2 rename fbm

...etc. That will rename the reads to indicate which sequence they matched (it will add it as a suffix so the original name is also retained). By default BBDuk will not let you go over hdist=3 unless you add the flag "-da". But, there are 2 ways to increase hamming distance in BBDuk, with "hdist" and "qhdist"; the first makes mutant copies of reference kmers, and the second makes mutant copies of query kmers. So to achieve hamming distance 6, it's most efficient to use "hdist=3 qhdist=3", since each is independently exponential.

**nucacidhunter** · 02-17-2017, 10:22 PM

Cell Ranger corrects one base mismatches if it confidently can be assigned to one of whitelist barcodes. 10x software might have a setting to relax match criteria.

**bloosnail** · 02-19-2017, 01:09 PM

@Brian, I am currently running BBDuk and it seems to be working well. Thank you for your suggestion. I am wondering if there is reasonable increase in time if more cores are used -- from the documentation, "This uses a fixed number of threads, currently 7, which cannot be changed (except by editing the “WAYS=7” line in BBDuk’s source code and recompiling)". Also, I think there is a slight typo in the manual in the installation section (http://jgi.doe.gov/data-and-tools/bb...llation-guide/), I think the command "$ tar -xvfz BBMap_(version)" should be "$ tar -xvzf BBMap_(version).tar.gz". Somewhat trivial, but could be confusing for someone without experience on the command line.

@nucacidhunter, I tried e-mailing 10X and they said they do not currently have built-in functionality to relax the settings, but there is a line in the source code where one can relax the cut-off for filtering. We also may try this, thank you for the suggestion.

**Brian Bushnell** · 02-21-2017, 04:17 PM

Originally posted by bloosnail View Post

@Brian, I am currently running BBDuk and it seems to be working well. Thank you for your suggestion. I am wondering if there is reasonable increase in time if more cores are used -- from the documentation, "This uses a fixed number of threads, currently 7, which cannot be changed (except by editing the “WAYS=7” line in BBDuk’s source code and recompiling)".

To clarify, BBDuk scales fine to an arbitrary number of threads for processing the reads (using the "threads" flag, but by default it uses all available threads). It uses 7 threads only when processing the reference. In most cases processing the reference takes a trivial amount of time, but it can take a while with a high hamming distance.

Increasing WAYS will increase speed up to the number of cores you have (the number needs to be odd though). If WAYS is above your core count it will decrease speed instead. But it doesn't increase speed all that much (all of the arithmetic is replicated; only the memory random accesses are parallel) so 7 is usually fine. Once the reference is processed, WAYS has no effect on runtime for processing the reads. So it's only useful to adjust if the majority of your time is spent processing the reference.

Note that "hdist" increases the time needed for loading the reference, while "qhdist" increases the time needed to process the reads.

Also, I think there is a slight typo in the manual in the installation section (http://jgi.doe.gov/data-and-tools/bb...llation-guide/), I think the command "$ tar -xvfz BBMap_(version)" should be "$ tar -xvzf BBMap_(version).tar.gz". Somewhat trivial, but could be confusing for someone without experience on the command line.

Thanks; I'll fix that!

**bloosnail** · 02-23-2017, 06:00 PM

Hi Brian,

I am looking more closely at the output from BBDuk and I think there may be a problem. It appears that whenever a read has multiple matches, the name of the barcode that is aligned to is always the same. For example, here is a grep'd read with multiple matches on the output .fq and this is what is returned:

@NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
@NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
@NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1
@NS500211:254:HGWNFBGX2:1:11101:11155:16312 barcode_number_536269=1

So what would be expected is numerous hits for the read ID, but ideally the barcode names would be different. It appears it is like this for every read with multiple matches. Would you know what the problem is? Here is the command I have been using (or some slight variation of it):

$ ../BBDuk/bbmap/bbduk.sh in=../no_CB_no_head.fq ref=
../bt2_index/barcodes.fa outm=matched1.fq out=unmatched1.fq k=14 mm=f -da hdist=1 rcomp=f rename fbm

**Brian Bushnell** · 02-23-2017, 10:39 PM

BBDuk and Seal are very similar, but different in an important way. BBDuk associates each kmer with a single reference sequence - specifically, the first one it encounters in the reference fasta. Seal associates each kmer with every reference sequence containing that kmer. The syntax is very similar, and both support read renaming. For contaminant removal or adapter-trimming, BBDuk is ideal. But in a situation where it is important to distinguish between similar things - such as isoform expression quantification, contaminant detection, and barcode matching... it's important to use Seal.

Sorry, I realize now that I should have been more clear on this initially. If I was routinely using 10X data I'm sure I'd have a protocol in place, but this is new to me. With a low hdist setting and a high hamming distance between barcodes (which was my initial expectation) the results are identical, but once you increase hdist/qhdist sufficiently, Seal should be used instead of BBDuk, when it's important to know which reference sequence a query hit.

Note that Seal still does not distinguish between 2 mismatches for reference sequence A versus 3 mismatches for reference sequence B. If you set hdist=2, the query will get tagged as matching A only. If you set hdist=3, the query will get tagged as matching both A and B with no indication that it matches A better; that information is not stored. I designed it for longer sequences where ties could be broken by the number of shared kmers; but when the sequence length is equal to the kmer length, obviously, that won't work as a tie-breaker since it's either 1 or 0.

Anyway - I think Seal is a good tool for this purpose, but tools are always best when you know their limitations!

**bloosnail** · 02-24-2017, 11:54 AM

Thank you for your thorough and quick response Brian. I have tried Seal out and it seems to be giving the correct output -- same number of reads but the multi-mapped reads have different IDs on them. Is it possible to still split up the hdist and qhdist to sum up to the desired Hamming distance to decrease exponential runtime? I tried running something like this:

../BBDuk/bbmap/seal.sh in
=unmatched3.fq ref=../bt2_index/barcodes.fa out=matched4.fq outu=unmatched4.fq k
=14 mm=f -da hdist=2 qhdist=2 rcomp=f rename fbm overwrite

for matches on Hamming distance of 4 and it gives a lot of NullPointerExceptions and crashes.

**Brian Bushnell** · 02-24-2017, 02:38 PM

Originally posted by bloosnail View Post

Thank you for your thorough and quick response Brian. I have tried Seal out and it seems to be giving the correct output -- same number of reads but the multi-mapped reads have different IDs on them. Is it possible to still split up the hdist and qhdist to sum up to the desired Hamming distance to decrease exponential runtime? I tried running something like this:

../BBDuk/bbmap/seal.sh in
=unmatched3.fq ref=../bt2_index/barcodes.fa out=matched4.fq outu=unmatched4.fq k
=14 mm=f -da hdist=2 qhdist=2 rcomp=f rename fbm overwrite

for matches on Hamming distance of 4 and it gives a lot of NullPointerExceptions and crashes.

It should not crash, so please send me all of the screen output (it all gets written to stderr). Of course it would be great if you could send me the input data as well, if it's not confidential.

**bloosnail** · 02-24-2017, 04:11 PM

Here is a link to the screen output from the command that I posted earlier: https://drive.google.com/open?id=0B_...0ZCQ3dQei1aUFU

Also, the number of input reads at the bottom should be 4,568,965 instead of 51. I'm not sure my supervisor would want me to post the data and it's also pretty big, but if the problem is not clear I can PM you a shortened version of it. Thanks for taking a look!

**Brian Bushnell** · 02-25-2017, 07:48 AM

I found and fixed the problem, and will upload the fixed version on Monday.

**Brian Bushnell** · 02-28-2017, 05:53 PM

This issue is now resolved as of BBMap v37.00!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Good way to align very, very short reads?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News