Seqanswers Leaderboard Ad

**Brian Bushnell** · 02-27-2015, 11:11 AM

Ah! You have "out1m" rather than "outm1". Hmmm, maybe I should make that also acceptable.

Incidentally, in most case you can just write this: "outm=matched#.fq" which will be translated to "outm=matched1.fq outm2=matched2.fq", to shorten the command lines. You can do that for input also.

**GenoMax** · 02-27-2015, 11:29 AM

I was thinking out1= and out1m= would be the logical pair. Should have tried the other way.

Can you add the info above to the original post on page 1 when you have a chance (I refer to the examples there when I am looking for something new and many probably look at that first post too)?

**Brian Bushnell** · 02-27-2015, 12:07 PM

OK, I updated it.

**jazz710** · 03-11-2015, 07:52 AM

I'd like to make a brief suggestion, thought it's obviously not important in the grand scheme.

I thought it would be nice if there was a batch setting for BBDuk. Currently, I have 48 fastq files all with nice name (Sample1.fq, Sample2.fq, etc).

It would be great if I could run something like:

bbduk.sh -Xmx1g batchdir=/fastqDir in=<>.fq out=<>.clean.fq ref=adapters.fa ktrim=r k=28 mink=12 hdist=1

Basically, if your FASTQ files organized and each will be treated identically, allow BBDuk to batch process them and change the input and output names automatically.

Great tool!

**GenoMax** · 03-11-2015, 08:02 AM

Just making a practical suggestion (not directly related to your request).

Rather than sequentially processing your files via a single command, if you have access to a cluster write a simple shell/job submit script to do 48 parallel submissions so you can be more efficient.

I won't be surprised if BBTools has such a setting already (that Brian may not have elaborated on so far

).

**Brian Bushnell** · 03-11-2015, 08:42 AM

No such setting yet

But I will consider adding it when I have time; it would be pretty useful for me, too.

**ysnapus** · 03-24-2015, 09:58 AM

Originally posted by Brian Bushnell View Post

I typically use k=25, hdist=1 for adapter trimming and removal of other short synthetic contaminant sequences, just because some of them are very short (under 30bp). For long contaminant sequences (phiX, ecoli, etc) I always use k=31 to maximize the specificity; with 100bp reads you should get a match SOMEWHERE, especially if you allow a mismatch. Note that the "mm" or "maskmiddle" flag (which defaults to true for filtering) is roughly equivalent to allowing an additional +1 hamming distance for free.

Hi Brian,

A couple of quick questions. Since adapters might only have small fragments attached to the ends of the reads, shouldn't we use a smaller mink? The bbduk.sh help lists a default mink=6. Also, for adapter trimming should we use some hdist2? (Say hdist2=1)?

**Brian Bushnell** · 03-24-2015, 10:23 AM

Originally posted by ysnapus View Post

Hi Brian,

A couple of quick questions. Since adapters might only have small fragments attached to the ends of the reads, shouldn't we use a smaller mink? The bbduk.sh help lists a default mink=6. Also, for adapter trimming should we use some hdist2? (Say hdist2=1)?

The current version has default mink listed as 0 (meaning disabled); it's trimq that's 6 by default.

Also, this is not really explained very well in the help, so let me state this clearly:

If you only set "hdist", that value will also be applied to "hdist2".
If you specify both "hdist" and "hdist2", they will be set independently.

So:

"hdist=2" will use 2 for both long and short kmers; "hdist=2 hdist2=1" will use 2 for long kmers and 1 for short kmers. Therefore, you don't need to set hdist2 unless you want it to be different from hdist.

The only reason to have "mink=11" rather than lower is to prevent false-positive trimming. At mink=6 and hdist=1, you have roughly 0.5% chance of false-positive trimming if you are using exactly one adapter sequence, which is still pretty low. It just depends on your tolerance for adapter sequence - if it's important to get rid of as much as possible, then a lower value of mink is better.

Normally, if the reads are a paired fragment library, I add the "tbo tpe" flags. "tbo" does a good job of getting the last few bases of adapter sequence that are too short to detect using kmer matching, down to even 1bp of adapter sequence.

**nepossiver** · 04-17-2015, 04:52 PM

I've started to experiment with BBDuk, and I would like some help understanding the output. Initially I understood "KTrimmed" as the results of kmer adapter removal, but after playing a bit it seems "KTrimmed" sums up kmer adapter removal and other trims, at least in some situations.

These tests are for 250bp MiSeq paired reads. BBDuk with minlen=51:

Code:

bbduk.sh -Xmx4g threads=4 in1=R1.fastq in2=R2.fastq out1=trimR1.fq out2=trimR2.fq outsingle=trimS.fq k=25 mink=12 hdist=1 ktrim=r ref=truseq.fa.gz,nextera.fa.gz minlen=51 tpe tbo

BBDuk version 34.86
Set threads to 4
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=4030m, used=86m

Added 161221 kmers; time: 	0.185 seconds.
Memory: free=3794m, used=322m

Input is being processed as paired
Started output streams:	0.009 seconds.
Processing time:   		28.745 seconds.

Input:                  	2967156 reads 		542374420 bases.
KTrimmed:               	132538 reads (4.47%) 	6891444 bases (1.27%)
Trimmed by overlap:     	9464 reads (0.32%) 	342404 bases (0.06%)
Result:                 	2859656 reads (96.38%) 	535140572 bases (98.67%)

Time:   			28.945 seconds.
Reads Processed:       2967k 	102.51k reads/sec
Bases Processed:        542m 	18.74m bases/sec

Then with minlen=201:

Code:

bbduk.sh -Xmx4g threads=4 in1=R1.fastq in2=R2.fastq out1=trimR1.fq out2=trimR2.fq outsingle=trimS.fq k=25 mink=12 hdist=1 ktrim=r ref=truseq.fa.gz,nextera.fa.gz minlen=201 tpe tbo

BBDuk version 34.86
Set threads to 4
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=4030m, used=86m

Added 161221 kmers; time: 	0.185 seconds.
Memory: free=3794m, used=322m

Input is being processed as paired
Started output streams:	0.011 seconds.
Processing time:   		23.263 seconds.

Input:                  	2967156 reads 		542374420 bases.
KTrimmed:               	1506815 reads (50.78%) 	184149438 bases (33.95%)
Trimmed by overlap:     	8290 reads (0.28%) 	273347 bases (0.05%)
Result:                 	1463616 reads (49.33%) 	357740553 bases (65.96%)

Time:   			23.465 seconds.
Reads Processed:       2967k 	126.45k reads/sec
Bases Processed:        542m 	23.11m bases/sec

Depending on minlen, "KTrimmed" reports very different numbers, but adapter contamination should be identical, am I correct? Wouldn't make more sense report reads removed for being too short under "QTrimmed", or maybe something as "LTrimmed"?

Finally, a warning to those unaware, and on the rare situation of using a computer with IBM Java (such as SUSE Enterprise): BBDuk (and GATK, Maven and some others) break under this Java, one has to install Oracle Java or OpenJava in order for them to work.

thanks for the fantastic tools, nep

**Brian Bushnell** · 04-19-2015, 10:15 PM

BBDuk tracks which operation lead to particular bases being removed. So, if you have no minlength, a 250bp read trimmed via kmers to 190bp will yield 60bp ktrimmed; or, if it had no matching kmers but had low quality so that it was quality-trimmed to 190bp, it would yield 60bp qtrimmed.

But let's say you set "minlen=200". If the length drops below the threshold, all bases will be considered trimmed, and the entire read will be associated with the reason it was eliminated - so a low-quality read will increment qtrimmed by +250, while a read with adapter bases at position 190 would increment ktrimmed by +250. Furthermore, in the default mode (which can be overridden with the "rieb=f" [removeifeitherbad=false] flag), if either read is trimmed below the threshold, both reads are discarded - so if minlen=200 and read 2 was trimmed to 190bp, in a 2x250bp run, both would be discarded. So, 500bp would be added to either ktrimmed or qtrimmed, whichever triggered the removal.

I prefer this over having an "ltrimmed" line for bases that were removed because the reads were too short, because in that case, you can't tell what triggered the removal. Of course, each way has advantages - if you are only doing a single operation, "ltrimmed" would be preferable - but I usually run BBDuk with multiple operations per pass.

**nepossiver** · 04-24-2015, 02:02 PM

Originally posted by Brian Bushnell View Post

I prefer this over having an "ltrimmed" line for bases that were removed because the reads were too short, because in that case, you can't tell what triggered the removal. Of course, each way has advantages - if you are only doing a single operation, "ltrimmed" would be preferable - but I usually run BBDuk with multiple operations per pass.

Thanks for the response, I understand your reasoning. I guess I could call bbduk twice to do a "two-pass" trimming, first trimming by minlen only, then adapters+minlen. As BBTools are so fast it does not cost much anyway.

By the way, is tbo the same method as described in Front Genet. 2014; 5: 5?

**Brian Bushnell** · 04-24-2015, 02:24 PM

Yes, it looks like the concept is the same, though the implementation is different as BBDuk uses BBMerge's overlapper code.

**IonTom** · 05-14-2015, 02:03 AM

Dear Brian,

i just started using BBduk and am really happy with it.

There is just a question concerning RNASeq from the NextSeq500.

Which setting would you recommend for quality trimming, as the Phred scores seem quite high independent of the sample quality ? I have been trying trimq 15 so far.

Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?

Many thanks for your help

**GenoMax** · 05-14-2015, 03:51 AM

Originally posted by IonTom View Post

Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?

repair.sh from BBMap can separate the unpaired reads.

Code:

$ repair.sh in1=read_1.q in2=read_2.fq out1=fixed_1.fq out2=fixed_2.fq outsingle=single.fq

**Brian Bushnell** · 05-14-2015, 09:28 AM

Originally posted by IonTom View Post

There is just a question concerning RNASeq from the NextSeq500.

Which setting would you recommend for quality trimming, as the Phred scores seem quite high independent of the sample quality ? I have been trying trimq 15 so far.

It is difficult to effectively quality-trim NextSeq data because the quality scores are both binned and inaccurate (for V1; V2 is still binned, but more accurate). For V1 chemistry, on the raw reads, I would recommend 15, and that should be adequate. But it depends on what you are doing with the data. BBMap, for example, does not generally need any quality-trimming unless the data is really terrible.

If you want optimal quality-trimming for NextSeq data, and are willing to spend a little more time, and you have a reference transcriptome (for RNA-seq), I would suggest first recalibrating the quality scores, like this:

Code:

bbmap.sh ref=transcriptome.fa in=reads.fq out=mapped.sam reads=8m maxindel=200

calctruequality.sh in=mapped.sam

bbduk.sh in=reads.fq out=trimmed.fq recal qtrim=r trimq=15

In this case the quality will be recalibrated by BBDuk prior to trimming. If you choose to recalibrate quality, I highly recommend using the latest version of BBTools (34.94) as it has had some extensive improvements for recalibration recently, specifically for NextSeq.

Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?

Many thanks for your help

For paired data, your input reads should be properly ordered prior to processing with BBDuk, and they will still be properly ordered after processing - pairs will always be kept together. In situations where one read is trimmed shorter than minlength and the other is not, both will be discarded. You can save the longer of the two reads by setting the "outs" stream, which will catch singletons.

If your pairing was broken prior to running BBDuk, then follow Genomax's suggestion and repair them first.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News