SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 132 04-18-2017 01:04 AM
Adapter trimming figo1019 RNA Sequencing 1 04-07-2014 10:58 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 04:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 07:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 12:53 PM

Reply
 
Thread Tools
Old 04-24-2014, 03:00 PM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default Introducing BBDuk: Adapter/Quality Trimming and Filtering

I'd like to introduce another member of the BBTools package, BBDuk. "Duk" stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools. It can do lots of different things, for example -

Adapter trimming:
bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
or
bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

(if your data is very low quality, you may wish to use more sensitive settings of hdist=2 k=21)

...where "ktrim=r" is for right-trimming (3' adapters), "ktrim=l" is for left-trimming (5' adapters), and "ktrim=N" just masks the kmers with "N". "k" specifies the kmer size to use (must be at most the length of the adapters) while "mink" allows it to use shorter kmers at the ends of the read (for example, k=11 for the last 11 bases). "hdist" means "hamming distance"; this allows one mismatch. Instead of "ref=file" you can alternately (or additionally) say "literal=ACTGGT,TTTGGTG" for those two literal strings. The BBTools package currently includes Truseq and Nextera adapters sequences in /bbmap/resources/ as truseq.fa.gz, truseq_rna.fa.gz, and nextera.fa.gz. You can specify whether or not BBDuk looks for the reverse-complement of the reference sequences as well as the forward sequence with the flag "rcomp=t" or "rcomp=f"; by default it looks for both. You can also restrict kmer operations such as adapter-trimming to only read 1 or read 2 with the "skipr1" or "skipr2" flags, or restrict them to the leftmost or rightmost X bases of a read with the restrictleft or restrictright flags. For normal paired-end fragment libraries, I recommend adding the flags "tbo", which specifies to also trim adapters based on pair overlap detection (which does not require known adapter sequences), and "tpe", which specifies to trim both reads to the same length (in the event that an adapter kmer was only detected in one of them).

Quality trimming:
bbduk.sh -Xmx1g in=reads.fq out=clean.fq qtrim=rl trimq=10
or
bbduk.sh -Xmx1g in=read1.fq in=read2.fq out1=clean1.fq out2=clean2.fq qtrim=rl trimq=10

This will quality-trim to Q10 using the Phred algorithm, which is much more accurate than naive trimming. "qtrim=rl" means it will trim the left and right sides; you can instead set "qtrim=l" or "qtrim=r".

Contaminant filtering:
bbduk.sh -Xmx1g in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt
or
bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=unmatched1.fq out2=unmatched2.fq outm1=matched1.fq outm2=matched2.fq ref=phix.fa k=31 hdist=1 stats=stats.txt

This will remove all reads that have a 31-mer match to phix (a common Illumina spikein, which is included in /bbmap/resources/), again allowing one mismatch. The "outm" stream will catch reads that matched a reference kmers. This allows you to split a set of reads based on the presence of something. "stats" will produce a report of which contaminant sequences were seen, and how many reads had them.

Kmer masking:
bbduk.sh -Xmx1g in=ecoli.fa out=ecoli_masked.fq ref=salmonella.fa k=25 ktrim=N rskip=0

This will mask all 25-mers in E.coli that are also shared by Salmonella, by converting them to N. You can change them to some other letter instead, like X.

BBDuk can handle single-ended reads, pairs in two files, or pairs interleaved in a single file (which will be autodetected based on read names, or can be overridden with the 'int=t' or 'int=f' flag). For example:

bbduk.sh -Xmx1g in=reads.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=r trimq=10

This will process the input as interleaved (which is forced since there are two output files). It will trim to Q10 on the right end only, and throw away reads shorter than 25bp after trimming. If one read is too short and the other read is OK, both will be thrown away, to preserve pair ordering. This can be changed with the "rieb" (removeIfEitherBad) flag; setting it to false will keep the reads unless both of them are too short.

BBDuk can process fasta, fastq, scarf, qual, and sam files, raw, gzipped, or bzipped. It can also handle references that are very large (even the human genome) in a reasonable amount of time and memory, if you increase the -Xmx parameter; it needs around 15 to 30 bytes per kmer. It can do operations such as quality-trimming and contaminant-filtering in the same pass, but not two different operations using kmers (such as left and right kmer trimming), although BBDuk2 can do that. BBDuk can also do various other filtering procedures such as complexity filtering, length filtering, gc-content filtering, average-quality filtering, chastity-filtering, and filtering by number of Ns.

For more details and features, you can run bbduk.sh with no parameters for the help menu, or ask a question in this thread.

Note! If your OS does not support shellscripts, you would replace
bbduk.sh -Xmx1g
with
java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF
...but leave the rest of the command the same. "C:\bbmap\current" would vary depending on where you unzipped bbmap.

The "-Xmx1g" flag tells BBDuk to use 1GB of RAM. When using the shellscript, BBDuk does not strictly need that flag and can autodetect the amount of memory available. When using a large reference, or a large value of "hdist" or "edist" (hamming distance and edit distance, both of which greatly increase the amount of memory needed), you may need to set this higher.


P.S. When processing dual files, instead of "in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq", you can simply use "in=r#.fq out=clean#.fq". The "#" symbol will be replaced by 1 and 2.

Last edited by Brian Bushnell; 10-26-2015 at 10:04 AM.
Brian Bushnell is offline   Reply With Quote
Old 04-24-2014, 03:01 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Now for some comparative performance tests. I generated some synthetic adapter-contaminated data, using this methodology:

First, I grabbed the first million reads from some bacterial project, 150bp HiSeq. This is to get real quality distributions.

Then I made a synthetic adapter file, called "gruseq.fa", by taking the truseq adapters (in /bbmap/resources/truseq.fa) and rotating the letters: A->T, C->A, G->C, T->G. Thus they should totally unlike biological sequences or real adapter sequences, yet computationally equivalent.

Then, I added adapters to the real reads using a special program I made (also in BBTools) called AddAdapters:

addadapters.sh in=reads.fq out=dirty.fq qout=33 ref=gruseq.fa right int=f

"int=f" makes the reads treated as single-ended, to simplify things. "right" means the adapters will be 3' type. So, they will be added at a random location from 0 to 149, and possibly run off the 3' end of the read, but the read length stays at 150. If the adapter ends before the end of the read, random bases are used to fill the rest. Approximately 50% of the reads get adapters, and 50% don't. After the adapter is added, each of the adapter bases is possibly changed to a new base, with a probability from the read's quality score for that base, to simulate sequencing error.

Next, I ran 3 different programs - Trimmomatic, Cutadapt, and BBDuk, and measured their performance. This was on a 16-core, dual-socket Sandy Bridge E node with 128GB RAM, reserved for exclusive use, and only interacting with local disk, so the rest of the cluster had no impact on the tests. Hyperthreading was enabled.

Quote:
time cutadapt -m 10 -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGATACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAACTGCGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGGTCCATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGCTAATTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATATCGCTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACAATTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAATCTGATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATAGGCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACTGATCTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGTCAGGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACCAGTATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAAGGCGTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATCGATTATTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATCGGAACGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGCGATCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAAACGAAACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGAACATATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGCTTTACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGCCAAGGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGGGACCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAACGTACGTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACTCGCCTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATAGCTGTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGGAAGGGTGAGACGTGCAACGAGGAGCAGGC dirty.fq > cutadapt.fq

real 8m43.129s
user 8m39.720s
sys 0m2.090s
(523.129 seconds)

Quote:
time java -Xmx8g -jar trimmomatic-0.32.jar SE -phred33 dirty.fq trimmomatic.fq ILLUMINACLIP:gruseq.fa:2:28:10 MINLEN:10

real 0m7.418s
user 1m50.996s
sys 0m3.061s
(7.418 seconds, 170.996 cpu-seconds)


Quote:
time bbduk.sh in=dirty.fq ref=gruseq.fa ktrim=r mink=12 hdist=1 out=bbduk.fq minlen=10

real 0m1.683s
user 0m9.926s
sys 0m0.813s
(1.683 seconds, 9.926 cpu-seconds)

So, Cutadapt is extremely slow, and BBDuk takes both the speed and efficiency wins by a large margin. Of course, accuracy is more important than speed, so I graded the results. addadapters.sh already replaced each read's original name with a synthetic name indicating its original length and the length that the read should be after trimming. For example, "@0_150_11" means that the read was originally 150bp, but should be 11bp after trimming, because an adapter was added at position 11 (0-based). This allows quantification of both the number of bases correctly and incorrectly removed. Ideally, "Adapters Remaining" and "Non-Adapter Removed" should both be zero after trimming. For reference, this is what the untrimmed file looks like:

Quote:
addadapters.sh in=dirty.fq grade

Total output: 1000000 reads 150000000 bases
Perfectly Correct (% of output): 499861 reads (49.986%) 74979150 bases (49.986%)
Incorrect (% of output): 500139 reads (50.014%) 75020850 bases (50.014%)

Adapters Remaining (% of adapters): 500139 reads (100.000%) 37776278 bases (25.184%)
Non-Adapter Removed (% of valid): 0 reads (0.000%) 0 bases (0.000%)
Roughly 50% of the reads have adapters and 25% of the bases are sequence that should be removed (either adapter or random bases after the adapter).

Quote:
addadapters.sh in=cutadapt.fq grade

Total output: 984925 reads 131869936 bases
Perfectly Correct (% of output): 690175 reads (70.074%) 87863452 bases (66.629%)
Incorrect (% of output): 294750 reads (29.926%) 44006484 bases (33.371%)

Adapters Remaining (% of adapters): 275167 reads (56.728%) 19789625 bases (15.007%)
Non-Adapter Removed (% of valid): 19583 reads (1.988%) 67473 bases (0.060%)

Quote:
addadapters.sh in=trimmomatic.fq grade

Total output: 981620 reads 131483263 bases
Perfectly Correct (% of output): 630177 reads (64.198%) 85281026 bases (64.861%)
Incorrect (% of output): 351443 reads (35.802%) 46202237 bases (35.139%)

Adapters Remaining (% of adapters): 285886 reads (59.342%) 19483613 bases (14.818%)
Non-Adapter Removed (% of valid): 65557 reads (6.678%) 131182 bases (0.117%)

Quote:
addadapters.sh in=bbduk.fq grade

Total output: 966786 reads 113303242 bases
Perfectly Correct (% of output): 901541 reads (93.251%) 103689866 bases (91.515%)
Incorrect (% of output): 65245 reads (6.749%) 9613376 bases (8.485%)

Adapters Remaining (% of adapters): 65243 reads (13.973%) 1229480 bases (1.085%)
Non-Adapter Removed (% of valid): 2 reads (0.000%) 27 bases (0.000%)
BBDuk performs quite well, vastly outperforming Cutadapt and Trimommatic on every metric. Trimmomatic and Cutadapt both do quite poorly, though of the two, Cutadapt has both a higher true positive rate and a much lower false-positive rate than Trimmomatic, so takes second place in accuracy. If anyone has any other adapter-trimming tools they commonly use, please reply and I'll be happy to test them with the same methodology. Also, if anyone has suggestions for better parameters, please reply; I have no experience with either tool so I'm basically using the defaults.

All of this is testable and repeatable - you can use your own data and your own adapters, or the attached "gruseq.fa.gz" file and this qual file to replicate my results. The exact numbers will depend somewhat on the quality values of the real data, and very slightly on the organism.

P.S. You can get slightly better accuracy with two passes using different values of k and hdist, like this:

Quote:

bbduk.sh -Xmx1g in=dirty.fq out=stdout.fq ref=gruseq.fa k=27 ktrim=r hdist=2 mink=16 | bbduk.sh -Xmx1g in=stdin.fq out=clean.fq ref=gruseq.fa k=23 ktrim=r hdist=0 ow mink=10

addadapters.sh in=bbduk.fq grade


Total output: 966944 reads 113088318 bases
Perfectly Correct (% of output): 911549 reads (94.271%) 104827222 bases (92.695%)
Incorrect (% of output): 55395 reads (5.729%) 8261096 bases (7.305%)

Adapters Remaining (% of adapters): 55393 reads (11.870%) 1006866 bases (0.890%)
Non-Adapter Removed (% of valid): 2 reads (0.000%) 20 bases (0.000%)
Attached Images
File Type: png AdapterTrimSpeed.png (32.8 KB, 48 views)
Attached Files
File Type: gz gruseq.fa.gz (269 Bytes, 23 views)

Last edited by Brian Bushnell; 11-19-2015 at 08:59 AM.
Brian Bushnell is offline   Reply With Quote
Old 05-20-2014, 07:08 PM   #3
punto_c
Member
 
Location: Japan, Osaka

Join Date: Oct 2013
Posts: 11
Default PE adapter trimming in single command

Hello Brian
First of all thank you for sharing your software
I am currently testing a few reads preprocessing tools (cutadapt, trimmomatic, fastx_tools and BBmap)

I can see from your post that is possible to quality trim PE reads in a single command:
bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10

I was wondering if it possible to do that also for Trimming the adapters
Anyways, should I give for the quality trimming the files without the adapters?

sorry if some questions might look to simple to you but I am not experted in RAW reads processing yet

anyway, thanks a lot in advance for your kind reply

bests

Salvatore
punto_c is offline   Reply With Quote
Old 05-20-2014, 07:10 PM   #4
punto_c
Member
 
Location: Japan, Osaka

Join Date: Oct 2013
Posts: 11
Default

Hello Brian
First of all thank you for sharing your software
I am currently testing a few reads preprocessing tools (cutadapt, trimmomatic, fastx_tools and BBmap)

I can see from your post that is possible to quality trim PE reads in a single command:
bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10

I was wondering if it possible to do that also for Trimming the adapters
Anyways, should I give for the quality trimming the files without the adapters?

sorry if some questions might look to simple to you but I am not experted in RAW reads processing yet

anyway, thanks a lot in advance for your kind reply

bests

Salvatore
punto_c is offline   Reply With Quote
Old 05-20-2014, 07:26 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Salvatore,

Yes, you can quality-trim and adapter-trim at the same time with BBDuk. The command line would be like this:

bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10 ktrim=r k=25 mink=11 ref=truseq.fq.gz hdist=1

You can use whatever adapter sequence you want (with the "ref=" flag), and I include the Illumina truseq adapters with the BBMap package, in the "resources" directory. The command above will allow up to 1 mismatch (hdist=1) and trim as few as 11 trailing adapter bases. You can of course increase the number of mismatches and decrease the minimal number of trailing bases that are trimmed. If you have really low-quality reads with an average insert size close to the read length (and thus a high rate of adapters) you should probably set hdist=2 and mink=6 to remove more of them, but be aware that those settings will increase the false positive detection rate.

It's best to do adapter-trimming first, then quality-trimming, because if you do quality-trimming first, sometimes adapters will be partially trimmed and become too short to be recognized as adapter sequence. When you run BBDuk with both quality-trimming and adapter-trimming in the same run, it will do adapter-trimming first, then quality-trimming.
Brian Bushnell is offline   Reply With Quote
Old 05-26-2014, 11:40 PM   #6
punto_c
Member
 
Location: Japan, Osaka

Join Date: Oct 2013
Posts: 11
Default

Ciao

Thank you a lot for your nice and fast reply
I wanted to inform you that based on my first simple tests on illumina data your method performed best, followed closely by trimmomatic
the other methods tested where cutadapt and fastx but the latter are way behind bbduk and trimmomatic

ciao ciao

Salvatore
punto_c is offline   Reply With Quote
Old 05-27-2014, 08:51 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Salvatore,

Thanks for the feedback!

-Brian
Brian Bushnell is offline   Reply With Quote
Old 06-02-2014, 01:59 AM   #8
punto_c
Member
 
Location: Japan, Osaka

Join Date: Oct 2013
Posts: 11
Default

Hello Brian

sorry if I write on the wrong post, but I wanted to ask you about dedupe
I wanted to know if it possible to output simple stats about the repeated sequences
For example:
number of copies, and length (I am assuming 100% identity among the copies)
Alternatively, also a single file with all the repetitions would do
for now, if I am not wrong the option outd gives the file with only 1 copy of each of the repeated sequences

thanks a lot in advance for you help

bests

Salvatore
punto_c is offline   Reply With Quote
Old 06-02-2014, 08:30 AM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Salvatore,

"outd" will print every duplicate removed. So if some read has 5 copies, then 1 copy will go to "out" and 4 copies will go to "outd". Therefore I think that's what you're looking for.

By default only exact duplicates and exact containments will be removed. So if you have a 200bp contig X and a 100bp contig Y such that Y is a substring of X, then Y will be removed and will be kept. You can adjust this with the "ac" (absorb containment) flag.

Also, if the input is paired reads, they will only be removed if both reads exactly match another pair.
Brian Bushnell is offline   Reply With Quote
Old 06-02-2014, 04:35 PM   #10
punto_c
Member
 
Location: Japan, Osaka

Join Date: Oct 2013
Posts: 11
Default

Hello Brian
Thank you for your reply
Actually I realised the copies were there just after I posted my question
anyway it was not clear to me that only n-1 of the copies goes to the file specified in dout

thanks again

Salvatore
punto_c is offline   Reply With Quote
Old 06-17-2014, 10:19 AM   #11
manuelkleiner
Junior Member
 
Location: Calgary

Join Date: Jun 2014
Posts: 5
Default Cut off the first 10 bp of each read with BBDuk

Dear Brian,
Thank you for BBDuk. Great tool!
I have been using it so far for quality and adapter trimming, but there is one additional trimming step that I would like to do with it and I am not sure if and how it is possible with BBDuk.
The first 10 bases of my reads are usually of high quality, but I observe that their GCAT-content is not as even as it should be. So I would like to cut the first 10 bases of all reads during the quality and adapter trimming. Is that possible with BBDuk?
I have been doing this with nesoni clip so far, but it would be great if I could do all the trimming in BBDuk in one or two steps.
Thank you,
Manuel Kleiner
manuelkleiner is offline   Reply With Quote
Old 06-17-2014, 12:33 PM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Manuel,

A couple of comments. First, this is currently possible with Reformat but not with the released version of BBDuk. I'll update it later today or this week so that it will be possible to do both adapter and positional trimming in once command. Until then, you can do this:

reformat.sh in=reads.fq out=trimmed.fq ftl=10

...where "ftl" means "force trim left".

Second, it's important to make sure these bases should really be trimmed. We have been generating some Nextera libraries recently with very erratic base frequency for the first 20 bases:



The top is the base composition histogram before adapter-trimming, and the bottom is after (this has read 1 from 0-150 and read 2 from 152-302); note how the right part of the read looks much better after adapter trimming. But the first 20 bases look terrible! However, I mapped the adapter-trimmed reads to the assembly with BBMap using the 'mhist' flag, which generates a histogram of the rates of match/substitution/insertion/deletion rates by read position:



The error rate is a little higher for the first few bases, but still well under 1%, so we are not going to trim the first 20bp off of those reads, as was initially proposed. The reads are accurate even though the base composition is highly biased, because the fragmentation was not random (this uses some kind of enzyme). Generally, before you trim off bases because of a skewed base composition histogram, I suggest mapping to see if there actually is a higher error rate there.

For reference, the command to generate those histograms:

bbmap.sh in=reads.fq ref=assembly.fa mhist=mhist.txt bhist=bhist.txt nodisk

-Brian
Attached Images
File Type: png trimming_nextera.png (69.3 KB, 1585 views)
File Type: png nextera_trimmed_mhist.png (24.0 KB, 1560 views)
Brian Bushnell is offline   Reply With Quote
Old 06-17-2014, 01:33 PM   #13
manuelkleiner
Junior Member
 
Location: Calgary

Join Date: Jun 2014
Posts: 5
Default

Hi Brian,
Thank you for your fast reply and for explaining how I can check if these first 10 bases should really be trimmed. Very elegant way to test this.
Best,
Manuel
manuelkleiner is offline   Reply With Quote
Old 06-20-2014, 12:59 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Manuel,

The latest release of BBTools (32.32) now includes the "ftl" and "ftr" flags in BBDuk, so you can force-trim the leftmost 10 bases at the same time as adapter-trimming.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 06-20-2014, 01:10 PM   #15
manuelkleiner
Junior Member
 
Location: Calgary

Join Date: Jun 2014
Posts: 5
Default

Cool thank you!
manuelkleiner is offline   Reply With Quote
Old 07-03-2014, 07:53 AM   #16
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default

Hi Brian,

Can you elaborate on what you mean by the quality trimming is done using the Phred algorithm and how this is more accurate than naive trimming?

Thank you
salamay is offline   Reply With Quote
Old 07-03-2014, 08:51 AM   #17
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Imagine a read with this quality profile:

40, 40, 40, 40, 2, 2, 2, 2, 40, 2

What I would term "naive trimming" to Q10 would trim only the last base with quality 2, and stop because the next base has Q40. This would leave 4 internal bases with Q2, which is not desirable.

The Phred algorithm would trim the last 6 bases, because their average quality (calculated by summing the error probabilities) is 2.79, which is below 10. Trimming regions with average quality below a threshold gives the optimal result in terms of the ratio of retained bases to the expected number of errors.
Brian Bushnell is offline   Reply With Quote
Old 07-03-2014, 09:48 AM   #18
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default

What would bbduk do if for instance the right and left ends were high quality but a region in the middle of the read was low quality? For instance:

40, 40, 40, 40, 2, 2, 2, 40, 40, 40

Your explanation was excellent but I am just curious about this special case (if it ever even occurs).
salamay is offline   Reply With Quote
Old 07-03-2014, 10:02 AM   #19
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

salamay,

In that case, it would trim all but the first 4 bases. On the other hand, in this case:

40, 40, 40, 2, 2, 2, 40, 40, 40, 40

...it would trim all but the LAST four bases, because the there are more high-quality bases on the right end than the left end. The result after trimming is always guaranteed to be the largest possible area with an average quality above the threshold, such that no flanking sub-region has average quality below the threshold. Illumina reads typically have lowest quality on the ends and higher quality in the middle, but you can get low quality in the middle if there's a laser failure or air bubbles in the middle of the run. A single bad base in the middle won't cause half the read to be trimmed unless the average quality threshold is set pretty high, but multiple bad bases in the middle will cause either the first or last half of the read to be trimmed to get rid of them.
Brian Bushnell is offline   Reply With Quote
Old 07-03-2014, 10:04 AM   #20
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default

That is extremely informative Brian. Thank you very much, I am very pleased with the bb suite of programs.
salamay is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:36 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO