SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trimmomatic vs bbduk.sh mslider Bioinformatics 1 04-18-2017 10:10 AM
bbduk and kmer masking cmccabe Bioinformatics 2 10-30-2015 10:16 AM
ion proton. Reads length much shorter than library size? slls9969 Ion Torrent 2 03-10-2015 07:44 AM
K-mer information and minimum contig size in SPAdes Tanner_6984 Bioinformatics 0 09-25-2014 11:33 AM
k-mer size impacts coverage distribution (animated gif inside!) seb567 Bioinformatics 0 11-06-2010 05:20 PM

Reply
 
Thread Tools
Old 08-10-2017, 10:14 AM   #1
cstack
Junior Member
 
Location: Florida, US

Join Date: May 2017
Posts: 9
Default bbduk.sh masking shorter than k-mer size

I've come across a situation using bbduk to mask the k-mers from one genome assembly in another where the resulting number of masked bases is some sequences is shorter than the k value.

I am running the command like this:
Code:
# version 37.something.  
bbduk.sh in=mygenome.fa out=mygenome_masked.fa ref=ecoli.fasta k=15 
qkmask=X maskfullycovered=t maskmiddle=f
In mygenome_masked.fa (a multisequence fasta file), there are a sizable number of sequences with a total of 0 < bases_masked < k(=15). It seems strange to have 3 nucleotides masked when k=15, and I am wondering if anyone can point out what options I should be using to prevent this from happening
cstack is offline   Reply With Quote
Old 08-10-2017, 12:40 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,637
Default

That would be the "maskfullycovered" flag. That means only bases covered entirely by reference kmers will be masked.

For example, take this sequence you want to mask:

Code:
ACGTTGCA
And this reference:

Code:
CGTTG
The ref kmers (ignoring reverse complement, at K=3) are CGT, GTT, and TTG. They line up like:
Code:
   TTG
  GTT
 CGT
ACGTTGCA
Every based is covered by 3 kmers, but only the first T is "fully covered" - covered by 3 ref kmers. So it's the only one masked. Whereas without "maskfullycovered", the entire CGTTG would be masked.

Incidentally, it depends on what your goal is, but normally I find K=15 to be very short for masking... typically I use K=31.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:41 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO