View Single Post
Old 04-24-2014, 03:00 PM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default Introducing BBDuk: Adapter/Quality Trimming and Filtering

I'd like to introduce another member of the BBTools package, BBDuk. "Duk" stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools. It can do lots of different things, for example -

Adapter trimming:
bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
or
bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

(if your data is very low quality, you may wish to use more sensitive settings of hdist=2 k=21)

...where "ktrim=r" is for right-trimming (3' adapters), "ktrim=l" is for left-trimming (5' adapters), and "ktrim=N" just masks the kmers with "N". "k" specifies the kmer size to use (must be at most the length of the adapters) while "mink" allows it to use shorter kmers at the ends of the read (for example, k=11 for the last 11 bases). "hdist" means "hamming distance"; this allows one mismatch. Instead of "ref=file" you can alternately (or additionally) say "literal=ACTGGT,TTTGGTG" for those two literal strings. The BBTools package currently includes Truseq and Nextera adapters sequences in /bbmap/resources/ as truseq.fa.gz, truseq_rna.fa.gz, and nextera.fa.gz. You can specify whether or not BBDuk looks for the reverse-complement of the reference sequences as well as the forward sequence with the flag "rcomp=t" or "rcomp=f"; by default it looks for both. You can also restrict kmer operations such as adapter-trimming to only read 1 or read 2 with the "skipr1" or "skipr2" flags, or restrict them to the leftmost or rightmost X bases of a read with the restrictleft or restrictright flags. For normal paired-end fragment libraries, I recommend adding the flags "tbo", which specifies to also trim adapters based on pair overlap detection (which does not require known adapter sequences), and "tpe", which specifies to trim both reads to the same length (in the event that an adapter kmer was only detected in one of them).

Quality trimming:
bbduk.sh -Xmx1g in=reads.fq out=clean.fq qtrim=rl trimq=10
or
bbduk.sh -Xmx1g in=read1.fq in=read2.fq out1=clean1.fq out2=clean2.fq qtrim=rl trimq=10

This will quality-trim to Q10 using the Phred algorithm, which is much more accurate than naive trimming. "qtrim=rl" means it will trim the left and right sides; you can instead set "qtrim=l" or "qtrim=r".

Contaminant filtering:
bbduk.sh -Xmx1g in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt
or
bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=unmatched1.fq out2=unmatched2.fq outm1=matched1.fq outm2=matched2.fq ref=phix.fa k=31 hdist=1 stats=stats.txt

This will remove all reads that have a 31-mer match to phix (a common Illumina spikein, which is included in /bbmap/resources/), again allowing one mismatch. The "outm" stream will catch reads that matched a reference kmers. This allows you to split a set of reads based on the presence of something. "stats" will produce a report of which contaminant sequences were seen, and how many reads had them.

Kmer masking:
bbduk.sh -Xmx1g in=ecoli.fa out=ecoli_masked.fq ref=salmonella.fa k=25 ktrim=N rskip=0

This will mask all 25-mers in E.coli that are also shared by Salmonella, by converting them to N. You can change them to some other letter instead, like X.

BBDuk can handle single-ended reads, pairs in two files, or pairs interleaved in a single file (which will be autodetected based on read names, or can be overridden with the 'int=t' or 'int=f' flag). For example:

bbduk.sh -Xmx1g in=reads.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=r trimq=10

This will process the input as interleaved (which is forced since there are two output files). It will trim to Q10 on the right end only, and throw away reads shorter than 25bp after trimming. If one read is too short and the other read is OK, both will be thrown away, to preserve pair ordering. This can be changed with the "rieb" (removeIfEitherBad) flag; setting it to false will keep the reads unless both of them are too short.

BBDuk can process fasta, fastq, scarf, qual, and sam files, raw, gzipped, or bzipped. It can also handle references that are very large (even the human genome) in a reasonable amount of time and memory, if you increase the -Xmx parameter; it needs around 15 to 30 bytes per kmer. It can do operations such as quality-trimming and contaminant-filtering in the same pass, but not two different operations using kmers (such as left and right kmer trimming), although BBDuk2 can do that. BBDuk can also do various other filtering procedures such as complexity filtering, length filtering, gc-content filtering, average-quality filtering, chastity-filtering, and filtering by number of Ns.

For more details and features, you can run bbduk.sh with no parameters for the help menu, or ask a question in this thread.

Note! If your OS does not support shellscripts, you would replace
bbduk.sh -Xmx1g
with
java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF
...but leave the rest of the command the same. "C:\bbmap\current" would vary depending on where you unzipped bbmap.

The "-Xmx1g" flag tells BBDuk to use 1GB of RAM. When using the shellscript, BBDuk does not strictly need that flag and can autodetect the amount of memory available. When using a large reference, or a large value of "hdist" or "edist" (hamming distance and edit distance, both of which greatly increase the amount of memory needed), you may need to set this higher.


P.S. When processing dual files, instead of "in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq", you can simply use "in=r#.fq out=clean#.fq". The "#" symbol will be replaced by 1 and 2.

Last edited by Brian Bushnell; 10-26-2015 at 10:04 AM.
Brian Bushnell is offline   Reply With Quote