View Single Post
Old 09-24-2014, 12:40 AM   #2
Brian Bushnell
Super Moderator
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

That's a tall order, considering some of it is per-read and some is dependent on the entirety of your data. There is no tool of which I am aware that can do it all in a single command.

1) BBDuk ("qtrim + trimq" and "ktrim=r + ref" flags). It comes with Truseq and Nextera adapter files.
2) BBDuk ("maxns" flag)
3) BBDuk ("maq" flag - that stands for 'min average quality'). It's also possible to screen by %ID using BBMap, though, if you have a reference. If you want to rely on the sequencer's accuracy estimation, BBDuk's "maq" filters by overall expected error rate, so "maq=10" (phred-scaled) will eliminate reads in which at least 10% of the bases are expected to be incorrect. For 20%, the flag would be "maq=7".

#1-3 can be done in one command by BBDuk. The rest cannot.

4) It's important to note whether you have a reference. This can be done via mapping with various tools, or via matching with tools like Dedupe (which allows inexact matches but is usually less sensitive than mapping to a reference). Dedupe takes pairing into account, but it also uses a substantial amount of memory, for large libraries. Mapping-based tools require a reference, but less memory.
5) Error-correction requires consensus, and is much slower and more subjective than the other operations. There are various tools for this - I recommend BBNorm, with a command like " in=reads.fq out=corrected.fq". But there are other tools, such as Musket. Compared to other tool categories, I would be most worried about error-correction, as it is more subjective and has a greater chance of biasing your results. I do not recommend it except where necessary (such as when you have a huge amount of data, or very high substitution-type error rate, or highly variable coverage, as in amplified single-cell data). BBNorm does not use more memory or go slower as a result of more data.

Some of these functions are also possible in Trimmomatic and Cutadapt. Deduplication, if you have a reference, can also be done by samtools in conjunction with any pair-aware mapping program, using far less memory than Dedupe, though much more time.

Last edited by Brian Bushnell; 09-24-2014 at 12:59 AM.
Brian Bushnell is offline   Reply With Quote