![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PubMed: A transcriptional sketch of a primary human breast cancer by 454 deep sequenc | Newsbot! | Literature Watch | 0 | 04-22-2009 06:02 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
MinHash Sketch is a method of rapidly comparing large strings or sets. In genomics, you can use it like this:
1) Gather all the kmers in a genome. 2) Apply a hash function to them. 3) Keep the 10000 smallest hashcodes and call this set a "sketch". If you do this for multiple genomes, you can calculate how similar two genomes are much faster than via alignment; and specifically, the greatest advantage is that the speed of sketch comparison is unrelated to genome size. For example, if you assemble something, you can sketch it and compare it to sketches of everything in nt or RefSeq in a couple seconds. If the top hit shows that your new sketch shared 90% of its hash codes with E.coli, that means it has roughly 90% kmer identity to E.coli, and therefore, it is E.coli. The BBMap package has an extremely efficient MinHash Sketch implementation, now accessible through 4 programs. The latest, just released in BBMap 36.92, is SendSketch: Code:
sendsketch.sh in=contigs.fa The other tools are for processing sketches locally. To sketch and query refseq, for example: Code:
sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=sequence Code:
sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=taxa tree=tree.taxtree.gz gi=gitable.int1d.gz taxlevel=species Code:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz unzip taxdmp.zip taxtree.sh names.dmp nodes.dmp tree.taxtree.gz gitable.sh gi_taxid_nucl.dmp.gz,gi_taxid_prot.dmp.gz gitable.int1d.gz Code:
comparesketch.sh in=contigs.fa refseq*.sketch Code:
kmercountexact.sh in=reads.fq sketch=out.sketch mincount=5 Code:
tadpole.sh in=reads.fq out=contigs.fa sendsketch.sh in=contigs.fa Code:
sendsketch.sh in=ref.fa Loaded 1 sketches in 1.288 seconds. Results for ChrIII_A_nidulans_FGSC_A4: WKID 99.75% KID 96.96% matches 9696 compared 9720 taxID 227321 gSize 43368715 Aspergillus nidulans FGSC A4 WKID 0.10% KID 0.05% matches 5 compared 5103 taxID 331117 gSize 14931471 Aspergillus fischeri NRRL 181 WKID 0.06% KID 0.04% matches 4 compared 6204 taxID 1509407 gSize 18030896 Aspergillus nomius NRRL 13137 WKID 0.05% KID 0.04% matches 4 compared 8485 taxID 5061 gSize 39417003 Aspergillus niger WKID 0.07% KID 0.03% matches 3 compared 4561 taxID 344612 gSize 13236981 Aspergillus clavatus NRRL 1 WKID 0.06% KID 0.03% matches 3 compared 4654 taxID 330879 gSize 13837118 Aspergillus fumigatus Af293 Total Time: 2.605 seconds. taxID is the NCBI taxon identifier, gSize is the number of bases used to create the sketch, and "compared" basically tells you how many kmers were compared before the went out of the kmer range of the larger sketch (this is related to genome size and WKID calculation). As you can see, there are a few kmer hits to other Aspergillus strains, but what we have here is clearly Aspergillus nidulans. Last edited by Brian Bushnell; 02-03-2017 at 08:53 AM. |
![]() |
![]() |
![]() |
#2 |
Registered Vendor
Location: Eugene, OR Join Date: May 2013
Posts: 521
|
![]()
Insane!
Perfect timing... I've been doing 10x linked read assemblies of metagenomes and this will greatly speed the characterization of the resulting scaffolds.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com |
![]() |
![]() |
![]() |
#3 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Oh, good! Currently, by the way, that comparesketch.sh (for local comparisons) currently has more configurability for displaying results than sendsketch.sh (in terms of being able to specify number of records returned per query, identity cutoffs, displaying complete lineage, etc). Eventually sendsketch.sh will catch up though. sendsketch.sh can currently do per-sequence queries from a single file, which is probably what you want to do.
|
![]() |
![]() |
![]() |
#4 |
Junior Member
Location: Leicester, UK Join Date: Aug 2017
Posts: 1
|
![]()
Super rad! I tried to use sendsketch.sh to quickly annotate a small assembly:
Code:
sendsketch.sh mode=sequence k=31,24 in=phage_07.fna ERROR: The sketch is not compatible with this server. Settings: k=31,24 amino=false You may need to download a newer version of BBTools; this is version 37.38 It's a fresh download of the most recent version, so not sure what to do! Super excited to try it out. In the meantime, will try local comparison! Last edited by nate85; 08-07-2017 at 11:02 AM. |
![]() |
![]() |
![]() |
#5 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hi Nate,
Thanks for letting me know! I accidentally included old blacklists in 37.38. That's fixed in 37.40 which I just released. I validated that downloading it from SourceForge and running it works correctly, but please let me know if you still encounter problems. |
![]() |
![]() |
![]() |
#6 |
Member
Location: Paris, France Join Date: Apr 2017
Posts: 37
|
![]()
Brian,
I'd like to combine a few of your tools to put together a way to identify unknown, and possibly novel, viral sequences. I have some sequences with known hits to viruses that would map directly to the virus, and then also that will map to some viruses after being translated into different frames. Using tadpole, translate6frames and sendsketch, I am unable to consistently find all of the hits that I know are there. These are 150bp reads. This is what I tried tadpole.sh in=reads.fq out=contigs.fa k=50 translate6frames.sh in=contigs.fa out=contigs_6frames.fa aaout=f sendsketch.sh in=contigs_6frames.fa address=refseq Can you think of any way to improve this to increase sensitivity? Thanks for all of these tools by the way, sendsketch is incredible. I don't know what I would do without BBTools. |
![]() |
![]() |
![]() |
#7 | ||
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
You could try something like this (will take a while as large files need to be downloaded): First, run bbmap/pipelines/fetchTaxonomy.sh Then: Code:
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.*.protein.faa.gz cat viral.*.protein.faa.gz > refseq_viral.faa.gz gi2taxid.sh in=refseq_viral.fa.gz out=renamed.fa.gz tree=auto table=null accession=auto taxpath=/path/to/taxonomy sketch.sh in=renamed.fa.gz out=viral_AA.sketch mode=taxa tree=auto accession=null gi=null ow minsize=20 prefilter autosize k=9,6 amino taxpath=/path/to/taxonomy Code:
comparesketch.sh in=translated6frames.faa k=9,6 amino ref=viral_AA.sketch Code:
Query: Escherichia virus Lambda Seqs: 6 Bases: 97000 gSize: 94790 SketchLen: 969 TaxID: 10710 WKID KID ANI Complt Contam Matches Unique noHit TaxID gSize gSeqs taxName 100.00% 15.81% 100.00% 100.00% 32.37% 147 43 482 10710 14785 74 Escherichia virus Lambda 20.35% 3.76% 80.87% 100.00% 44.31% 35 0 484 10730 17271 80 Enterobacteria phage 933W 11.92% 3.78% 75.31% 100.00% 45.37% 31 0 417 194949 26090 170 Escherichia phage Stx2 II 11.37% 3.68% 74.84% 100.00% 45.37% 29 0 402 194948 25585 167 Escherichia Stx1 converting phage 15.03% 2.83% 77.67% 100.00% 45.48% 26 0 475 489779 17432 83 Escherichia phage Min27 (etc) That said, the RefSeq viral database is so small, and viral assemblies are so small, that it would be a lot faster in this case to just use BLAST (either versus just viral or all of RefSeq) unless you plan on doing a lot of queries. Quote:
|
||
![]() |
![]() |
![]() |
#8 |
Member
Location: Sweden Join Date: Jan 2017
Posts: 18
|
![]()
What would be the optimal way to use these sketching tools if I want to make metagenome sample comparisons similar to what e.g. Sourmash can do? My first thought would be to sketch each sample and then do an all-vs-all comparison using comparesketch.sh, something like this example:
Code:
for sample in sample1 sample2 sample3; do sketch.sh in=${sample}_R1.fq.gz out=${sample}.sketch.gz done comparesketch.sh alltoall *sketch.gz format=3 It looks like the "name" of each sketch is called after the first read in each file, which gives very hard-to-interpret names in the output listing. Is it possible to adjust this somehow, other than editing the NM0-field in the sketch file? Last edited by boulund; 04-14-2018 at 07:38 AM. |
![]() |
![]() |
![]() |
#9 | |
Member
Location: Sweden Join Date: Jan 2017
Posts: 18
|
![]() Quote:
Code:
name0= ![]() I wrote a quick Python script that takes the output from comparesketch.sh and produces a heatmap comparison of sample similarity, and it looks all right. Good for now! Thanks for a great tool Brian! |
|
![]() |
![]() |
![]() |
#10 |
Junior Member
Location: UK Join Date: Sep 2014
Posts: 6
|
![]()
Hi Brian Bushnell,
Has the new 'screen' function of Mash2.0 been incorporated into BBtools sketch yet? It would be super useful for finding stuff in the SRA db with sendsketch! Thanks! https://mash.readthedocs.io/en/latest/ |
![]() |
![]() |
![]() |
Tags |
bbmap, bbtools, blast, minhash, sketch, taxonomy |
Thread Tools | |
|
|