Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: A transcriptional sketch of a primary human breast cancer by 454 deep sequenc Newsbot! Literature Watch 0 04-22-2009 05:02 AM

Thread Tools
Old 02-03-2017, 07:46 AM   #1
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,697
Default MinHash Sketch - A Tool for Rapid Sequence Comparison

MinHash Sketch is a method of rapidly comparing large strings or sets. In genomics, you can use it like this:

1) Gather all the kmers in a genome.
2) Apply a hash function to them.
3) Keep the 10000 smallest hashcodes and call this set a "sketch".

If you do this for multiple genomes, you can calculate how similar two genomes are much faster than via alignment; and specifically, the greatest advantage is that the speed of sketch comparison is unrelated to genome size. For example, if you assemble something, you can sketch it and compare it to sketches of everything in nt or RefSeq in a couple seconds. If the top hit shows that your new sketch shared 90% of its hash codes with E.coli, that means it has roughly 90% kmer identity to E.coli, and therefore, it is E.coli.

The BBMap package has an extremely efficient MinHash Sketch implementation, now accessible through 4 programs. The latest, just released in BBMap 36.92, is SendSketch:

Code: in=contigs.fa
This will make a sketch of your assembly, open an HTTP connection to JGI's taxonomy server, and send the sketch. The server will compare it to sketches of all of nt, and return the top hits. This is kind of convenient because the sketches sit around in memory all the time rather than needing to be loaded. Alternately, you can run a local server with "" if you want to use a different database.

The other tools are for processing sketches locally. To sketch and query refseq, for example:

Code: in=refseq.fa.gz out=refseq#.sketch files=31 mode=sequence
That will produce one sketch per sequence, and split them among 31 output files (for rapid loading). Alternatively, and preferably, you can produce one sketch per taxID, like this:

Code: in=refseq.fa.gz out=refseq#.sketch files=31 mode=taxa tree=tree.taxtree.gz gi=gitable.int1d.gz taxlevel=species
That will produce only one sketch per species, so for things like E.coli with thousands of variants, you won't get thousands of sketches, just one. For some purposes the subspecies level is probably better, though. tree.taxtree.gz and gitable.int1d.gz are created from NCBI's tax dump, like this:


unzip names.dmp nodes.dmp tree.taxtree.gz gi_taxid_nucl.dmp.gz,gi_taxid_prot.dmp.gz gitable.int1d.gz
Now that you have created some sketches, you can query them like this:

Code: in=contigs.fa refseq*.sketch
"in=" can be a comma-delimited list of either fasta or sketch files. Also, you can turn an input fasta into one sketch per sequence rather than just a single sketch with the "mode=sequence" flag. If you want to process using taxonomic information, add "tree=taxtree.gz" as well. You alternately compare raw fastq reads if you want. KmerCountExact can produce sketches like this:

Code: in=reads.fq sketch=out.sketch mincount=5
That will make a sketch only from kmers occuring at least 5 times (to avoid error kmers). Overall, the process is probably similar speed-wise to assembling with Tadpole and feeding the assembly to SendSketch/CompareSketch:

Code: in=reads.fq out=contigs.fa in=contigs.fa
So, that's how you run Sketch. Now, what does the output mean? Here's an example, using a strain of Aspergillus:

Code: in=ref.fa

Loaded 1 sketches in    1.288 seconds.

Results for ChrIII_A_nidulans_FGSC_A4:

WKID 99.75%     KID 96.96%      matches 9696    compared 9720   taxID 227321    gSize 43368715  Aspergillus nidulans FGSC A4
WKID 0.10%      KID 0.05%       matches 5       compared 5103   taxID 331117    gSize 14931471  Aspergillus fischeri NRRL 181
WKID 0.06%      KID 0.04%       matches 4       compared 6204   taxID 1509407   gSize 18030896  Aspergillus nomius NRRL 13137
WKID 0.05%      KID 0.04%       matches 4       compared 8485   taxID 5061      gSize 39417003  Aspergillus niger
WKID 0.07%      KID 0.03%       matches 3       compared 4561   taxID 344612    gSize 13236981  Aspergillus clavatus NRRL 1
WKID 0.06%      KID 0.03%       matches 3       compared 4654   taxID 330879    gSize 13837118  Aspergillus fumigatus Af293

Total Time:     2.605 seconds.
"matches" is the number of matching kmers (well, technically hash codes) between the query and reference sketch. KID (or kmer identity) is simply the percent of the kmers that matched, while WKID is the weighted kmer identity, which has been adjusted to compensate for genome size. For example, say human chromosome 1 is 8% of the human genome. It would get a KID of 8% when compared to human, since it would only share 8% of the kmers, but the WKID would still be 100%. So, that's the column that tells you the actual sequence similarity (disregarding size).

taxID is the NCBI taxon identifier, gSize is the number of bases used to create the sketch, and "compared" basically tells you how many kmers were compared before the went out of the kmer range of the larger sketch (this is related to genome size and WKID calculation).

As you can see, there are a few kmer hits to other Aspergillus strains, but what we have here is clearly Aspergillus nidulans.

Last edited by Brian Bushnell; 02-03-2017 at 07:53 AM.
Brian Bushnell is offline   Reply With Quote
Old 02-03-2017, 11:55 AM   #2
Registered Vendor
Location: Eugene, OR

Join Date: May 2013
Posts: 430


Perfect timing... I've been doing 10x linked read assemblies of metagenomes and this will greatly speed the characterization of the resulting scaffolds.
Providing nextRAD genotyping and PacBio sequencing services.
SNPsaurus is offline   Reply With Quote
Old 02-03-2017, 12:12 PM   #3
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,697

Oh, good! Currently, by the way, that (for local comparisons) currently has more configurability for displaying results than (in terms of being able to specify number of records returned per query, identity cutoffs, displaying complete lineage, etc). Eventually will catch up though. can currently do per-sequence queries from a single file, which is probably what you want to do.
Brian Bushnell is offline   Reply With Quote
Old 08-07-2017, 09:59 AM   #4
Junior Member
Location: Leicester, UK

Join Date: Aug 2017
Posts: 1

Super rad! I tried to use to quickly annotate a small assembly:

Code: mode=sequence k=31,24 in=phage_07.fna
but got the following:

ERROR: The sketch is not compatible with this server.
Settings: k=31,24 amino=false
You may need to download a newer version of BBTools; this is version 37.38

It's a fresh download of the most recent version, so not sure what to do! Super excited to try it out. In the meantime, will try local comparison!

Last edited by nate85; 08-07-2017 at 10:02 AM.
nate85 is offline   Reply With Quote
Old 08-07-2017, 02:19 PM   #5
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,697

Hi Nate,

Thanks for letting me know! I accidentally included old blacklists in 37.38. That's fixed in 37.40 which I just released. I validated that downloading it from SourceForge and running it works correctly, but please let me know if you still encounter problems.
Brian Bushnell is offline   Reply With Quote
Old 08-27-2017, 11:33 AM   #6
Location: Paris, France

Join Date: Apr 2017
Posts: 37


I'd like to combine a few of your tools to put together a way to identify unknown, and possibly novel, viral sequences.

I have some sequences with known hits to viruses that would map directly to the virus, and then also that will map to some viruses after being translated into different frames.

Using tadpole, translate6frames and sendsketch, I am unable to consistently find all of the hits that I know are there. These are 150bp reads.

This is what I tried in=reads.fq out=contigs.fa k=50 in=contigs.fa out=contigs_6frames.fa aaout=f in=contigs_6frames.fa address=refseq

Can you think of any way to improve this to increase sensitivity?

Thanks for all of these tools by the way, sendsketch is incredible. I don't know what I would do without BBTools.
jweger1988 is offline   Reply With Quote
Old 08-28-2017, 10:33 AM   #7
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,697

Originally Posted by jweger1988 View Post
This is what I tried in=reads.fq out=contigs.fa k=50 in=contigs.fa out=contigs_6frames.fa aaout=f in=contigs_6frames.fa address=refseq

Can you think of any way to improve this to increase sensitivity?
Actually, that should not work at all. The JGI RefSeq sketch server is using nucleotide references rather than protein references, so you should try without using translate6frames. To use an amino acid query you need to add the flag "amino", and basically use with a local reference, because none of the JGI sketch servers use amino acids currently. I may add one for nr though.

You could try something like this (will take a while as large files need to be downloaded):

First, run bbmap/pipelines/



cat viral.*.protein.faa.gz > refseq_viral.faa.gz in=refseq_viral.fa.gz out=renamed.fa.gz tree=auto table=null accession=auto taxpath=/path/to/taxonomy in=renamed.fa.gz out=viral_AA.sketch mode=taxa tree=auto accession=null gi=null ow minsize=20 prefilter autosize k=9,6 amino taxpath=/path/to/taxonomy
That will give you a local copy of a all RefSeq viral proteins, indexed on a per-taxa level. It's much simpler to do on a per-sequence level (since you don't need all the taxonomy-related stuff), but the sensitivity is much lower because proteins are so short, so I don't really recommend it. Sketch is really designed for long reference sequences. To subsequently run a comparison:

Code: in=translated6frames.faa k=9,6 amino ref=viral_AA.sketch
Running that series of commands on Lambda phage, I got:

Query: Escherichia virus Lambda Seqs: 6         Bases: 97000    gSize: 94790    SketchLen: 969  TaxID: 10710
WKID    KID     ANI     Complt  Contam  Matches Unique  noHit   TaxID   gSize   gSeqs   taxName
100.00% 15.81%  100.00% 100.00% 32.37%  147     43      482     10710   14785   74      Escherichia virus Lambda
20.35%  3.76%   80.87%  100.00% 44.31%  35      0       484     10730   17271   80      Enterobacteria phage 933W
11.92%  3.78%   75.31%  100.00% 45.37%  31      0       417     194949  26090   170     Escherichia phage Stx2 II
11.37%  3.68%   74.84%  100.00% 45.37%  29      0       402     194948  25585   167     Escherichia Stx1 converting phage
15.03%  2.83%   77.67%  100.00% 45.48%  26      0       475     489779  17432   83      Escherichia phage Min27
You might need the latest version which I just uploaded now (37.50), though, which fixes an "amino" flag parse error.

That said, the RefSeq viral database is so small, and viral assemblies are so small, that it would be a lot faster in this case to just use BLAST (either versus just viral or all of RefSeq) unless you plan on doing a lot of queries.

Thanks for all of these tools by the way, sendsketch is incredible. I don't know what I would do without BBTools.
You're welcome; I'm glad you're finding it helpful!
Brian Bushnell is offline   Reply With Quote

bbmap, bbtools, blast, minhash, sketch, taxonomy

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 02:15 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO