Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: A transcriptional sketch of a primary human breast cancer by 454 deep sequenc Newsbot! Literature Watch 0 04-22-2009 05:02 AM

Thread Tools
Old 02-03-2017, 07:46 AM   #1
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,637
Default MinHash Sketch - A Tool for Rapid Sequence Comparison

MinHash Sketch is a method of rapidly comparing large strings or sets. In genomics, you can use it like this:

1) Gather all the kmers in a genome.
2) Apply a hash function to them.
3) Keep the 10000 smallest hashcodes and call this set a "sketch".

If you do this for multiple genomes, you can calculate how similar two genomes are much faster than via alignment; and specifically, the greatest advantage is that the speed of sketch comparison is unrelated to genome size. For example, if you assemble something, you can sketch it and compare it to sketches of everything in nt or RefSeq in a couple seconds. If the top hit shows that your new sketch shared 90% of its hash codes with E.coli, that means it has roughly 90% kmer identity to E.coli, and therefore, it is E.coli.

The BBMap package has an extremely efficient MinHash Sketch implementation, now accessible through 4 programs. The latest, just released in BBMap 36.92, is SendSketch:

Code: in=contigs.fa
This will make a sketch of your assembly, open an HTTP connection to JGI's taxonomy server, and send the sketch. The server will compare it to sketches of all of nt, and return the top hits. This is kind of convenient because the sketches sit around in memory all the time rather than needing to be loaded. Alternately, you can run a local server with "" if you want to use a different database.

The other tools are for processing sketches locally. To sketch and query refseq, for example:

Code: in=refseq.fa.gz out=refseq#.sketch files=31 mode=sequence
That will produce one sketch per sequence, and split them among 31 output files (for rapid loading). Alternatively, and preferably, you can produce one sketch per taxID, like this:

Code: in=refseq.fa.gz out=refseq#.sketch files=31 mode=taxa tree=tree.taxtree.gz gi=gitable.int1d.gz taxlevel=species
That will produce only one sketch per species, so for things like E.coli with thousands of variants, you won't get thousands of sketches, just one. For some purposes the subspecies level is probably better, though. tree.taxtree.gz and gitable.int1d.gz are created from NCBI's tax dump, like this:


unzip names.dmp nodes.dmp tree.taxtree.gz gi_taxid_nucl.dmp.gz,gi_taxid_prot.dmp.gz gitable.int1d.gz
Now that you have created some sketches, you can query them like this:

Code: in=contigs.fa refseq*.sketch
"in=" can be a comma-delimited list of either fasta or sketch files. Also, you can turn an input fasta into one sketch per sequence rather than just a single sketch with the "mode=sequence" flag. If you want to process using taxonomic information, add "tree=taxtree.gz" as well. You alternately compare raw fastq reads if you want. KmerCountExact can produce sketches like this:

Code: in=reads.fq sketch=out.sketch mincount=5
That will make a sketch only from kmers occuring at least 5 times (to avoid error kmers). Overall, the process is probably similar speed-wise to assembling with Tadpole and feeding the assembly to SendSketch/CompareSketch:

Code: in=reads.fq out=contigs.fa in=contigs.fa
So, that's how you run Sketch. Now, what does the output mean? Here's an example, using a strain of Aspergillus:

Code: in=ref.fa

Loaded 1 sketches in    1.288 seconds.

Results for ChrIII_A_nidulans_FGSC_A4:

WKID 99.75%     KID 96.96%      matches 9696    compared 9720   taxID 227321    gSize 43368715  Aspergillus nidulans FGSC A4
WKID 0.10%      KID 0.05%       matches 5       compared 5103   taxID 331117    gSize 14931471  Aspergillus fischeri NRRL 181
WKID 0.06%      KID 0.04%       matches 4       compared 6204   taxID 1509407   gSize 18030896  Aspergillus nomius NRRL 13137
WKID 0.05%      KID 0.04%       matches 4       compared 8485   taxID 5061      gSize 39417003  Aspergillus niger
WKID 0.07%      KID 0.03%       matches 3       compared 4561   taxID 344612    gSize 13236981  Aspergillus clavatus NRRL 1
WKID 0.06%      KID 0.03%       matches 3       compared 4654   taxID 330879    gSize 13837118  Aspergillus fumigatus Af293

Total Time:     2.605 seconds.
"matches" is the number of matching kmers (well, technically hash codes) between the query and reference sketch. KID (or kmer identity) is simply the percent of the kmers that matched, while WKID is the weighted kmer identity, which has been adjusted to compensate for genome size. For example, say human chromosome 1 is 8% of the human genome. It would get a KID of 8% when compared to human, since it would only share 8% of the kmers, but the WKID would still be 100%. So, that's the column that tells you the actual sequence similarity (disregarding size).

taxID is the NCBI taxon identifier, gSize is the number of bases used to create the sketch, and "compared" basically tells you how many kmers were compared before the went out of the kmer range of the larger sketch (this is related to genome size and WKID calculation).

As you can see, there are a few kmer hits to other Aspergillus strains, but what we have here is clearly Aspergillus nidulans.

Last edited by Brian Bushnell; 02-03-2017 at 07:53 AM.
Brian Bushnell is offline   Reply With Quote
Old 02-03-2017, 11:55 AM   #2
Registered Vendor
Location: Eugene, OR

Join Date: May 2013
Posts: 414


Perfect timing... I've been doing 10x linked read assemblies of metagenomes and this will greatly speed the characterization of the resulting scaffolds.
Providing nextRAD genotyping services.
SNPsaurus is offline   Reply With Quote
Old 02-03-2017, 12:12 PM   #3
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,637

Oh, good! Currently, by the way, that (for local comparisons) currently has more configurability for displaying results than (in terms of being able to specify number of records returned per query, identity cutoffs, displaying complete lineage, etc). Eventually will catch up though. can currently do per-sequence queries from a single file, which is probably what you want to do.
Brian Bushnell is offline   Reply With Quote
Old 08-07-2017, 09:59 AM   #4
Junior Member
Location: Leicester, UK

Join Date: Aug 2017
Posts: 1

Super rad! I tried to use to quickly annotate a small assembly:

Code: mode=sequence k=31,24 in=phage_07.fna
but got the following:

ERROR: The sketch is not compatible with this server.
Settings: k=31,24 amino=false
You may need to download a newer version of BBTools; this is version 37.38

It's a fresh download of the most recent version, so not sure what to do! Super excited to try it out. In the meantime, will try local comparison!

Last edited by nate85; 08-07-2017 at 10:02 AM.
nate85 is offline   Reply With Quote
Old 08-07-2017, 02:19 PM   #5
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,637

Hi Nate,

Thanks for letting me know! I accidentally included old blacklists in 37.38. That's fixed in 37.40 which I just released. I validated that downloading it from SourceForge and running it works correctly, but please let me know if you still encounter problems.
Brian Bushnell is offline   Reply With Quote

bbmap, bbtools, blast, minhash, sketch, taxonomy

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 12:51 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO