SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Adapter trimming figo1019 RNA Sequencing 2 07-17-2018 05:00 AM
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 132 04-18-2017 02:04 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 05:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 08:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM

Reply
 
Thread Tools
Old 12-25-2014, 06:44 AM   #101
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

Quote:
Originally Posted by michaellim View Post
Hi Genomax,

Sorry, but I don't understand what is "/path/to/", if my files are all in C: drive Windows, will it be

Code:
$ java -ea -Xmx2g in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2 out1=??? out2=??? ref=C:\bbmap\resources\truseq.fa.gz ktrim=r
??? For out1 and out2 do I need to create some kind of new file?
??? I don't know what to put for k, mink and hdist, should k be 33 because the adapter sequences are 33 bases? not sure about mink and hdist.

Thank you. Merry Christmas
Assuming your BBMap directory is at c:\ level the following command gives you a general idea of how to structure your own.

out1 and out2 represent two new files names that you provide that will contain the adapter trimmed (cleaned) reads from R1 and R2 read files.

Code:
$ java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2.fastq out1=c:\sequencing\ST131_1_R1_cleaned.fastq out2=c:\sequencing\ST131_1_R2_cleaned.fastq ref=C:\bbmap\resources\truseq.fa.gz ktrim=r
If is possible that your reads are already adapter trimmed so no reads may get thrown out/trimmed after running BBDuk tool. Look at the stats produced at the end to see what happens.

Look at the contents of bbduk.sh file (open it in wordpad) to understand the meaning of optional parameters. You may or may not want/need to provide additional parameters.

Last edited by GenoMax; 12-25-2014 at 07:15 AM.
GenoMax is offline   Reply With Quote
Old 02-12-2015, 05:50 PM   #102
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sdriscoll View Post
Hey Brian,

bbduk and seal are certainly great programs and since I have downloaded these tools I've been using them pretty much daily. While most kmer tools ignore the idea of allowing a mismatch (since obviously this complicates hash lookups) I like that you have included this however using it imposes such a severe spike in memory usage that it becomes basically unusable. Case in point - using seal to produce hits to genes in a full transcriptome while allowing a mismatch per kmer. Is there a reason that you put the burden of additional kmers on the reference and not on the read? It seems to me that there would be a performance hit but I would much rather have that than the massive amount of memory needed to produce a kmer index of a full transcriptome allowing even one mismatch. So during matching the read could be kmer'd into all possible single mismatch variants and matched to the normal 0-mismatch kmer index. Or does that bring in additional issues that are not easily resolved?
I just released a new version of BBTools (34.48) which adds this capability. You can use it with the "qhdist" flag (queryhammingdistance). It doesn't really slow down very much as long the vast majority of the reads are expected to have kmer matches; but if the majority of reads don't have kmer matches, there will be a ~3*K-fold slowdown. But even then the speed is still acceptable for most purposes.

Here's a sensitivity analysis. I generated 100k 100bp reads from E.coli, and gave them each a random number of SNPs from 0 to 20 (probably 15 on average). Then I tested the percentage that were correctly identified as E.coli by BBDuk:

bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f
Time: 0.8 seconds
Filtered: 33.51%

bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t
Time: 0.8 seconds
Filtered: 39.83%

bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f hdist=1
Time: 30 seconds (29 of which were load time)
Filtered: 48.62%

bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f qhdist=1
Time: 10 seconds
Filtered: 48.62%

bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t qhdist=1
Time: 10 seconds
Filtered: 53.51%

bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t qhdist=1 hdist=1
Time: 42 seconds (29 of which were load time)
Filtered: 63.06%

So "mm=t" is not really as good as hdist=1 if there is a really high error rate, but it's free.

P.S. Incidentally, BBMap mapped around 55% on defaults and 66% with high sensitivity (k=12 minratio=0.1).

Last edited by Brian Bushnell; 02-23-2015 at 08:13 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-27-2015, 10:39 AM   #103
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

@Brian: With paired end reads out1m= and out2m= don't seem to work (only outm=). Is that by design?

Last edited by GenoMax; 02-27-2015 at 11:31 AM. Reason: Corrected
GenoMax is offline   Reply With Quote
Old 02-27-2015, 10:49 AM   #104
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi GenoMax,

I just tried it, and it seems to work fine for me. Are you getting an error message, or is one of the files simply not being produced? Also, what version are you running?
Brian Bushnell is offline   Reply With Quote
Old 02-27-2015, 10:58 AM   #105
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi GenoMax,

I just tried it, and it seems to work fine for me. Are you getting an error message, or is one of the files simply not being produced? Also, what version are you running?
I am using v. 34.33.

Code:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter out1m=xxxx_AACCAGCT_L001_R1_001_match.fastq.gz
	at jgi.BBDukF.<init>(BBDukF.java:400)
	at jgi.BBDukF.main(BBDukF.java:62)
GenoMax is offline   Reply With Quote
Old 02-27-2015, 11:11 AM   #106
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Ah! You have "out1m" rather than "outm1". Hmmm, maybe I should make that also acceptable.

Incidentally, in most case you can just write this: "outm=matched#.fq" which will be translated to "outm=matched1.fq outm2=matched2.fq", to shorten the command lines. You can do that for input also.
Brian Bushnell is offline   Reply With Quote
Old 02-27-2015, 11:29 AM   #107
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

I was thinking out1= and out1m= would be the logical pair. Should have tried the other way.

Can you add the info above to the original post on page 1 when you have a chance (I refer to the examples there when I am looking for something new and many probably look at that first post too)?
GenoMax is offline   Reply With Quote
Old 02-27-2015, 12:07 PM   #108
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

OK, I updated it.
Brian Bushnell is offline   Reply With Quote
Old 03-11-2015, 08:52 AM   #109
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

I'd like to make a brief suggestion, thought it's obviously not important in the grand scheme.

I thought it would be nice if there was a batch setting for BBDuk. Currently, I have 48 fastq files all with nice name (Sample1.fq, Sample2.fq, etc).

It would be great if I could run something like:

bbduk.sh -Xmx1g batchdir=/fastqDir in=<>.fq out=<>.clean.fq ref=adapters.fa ktrim=r k=28 mink=12 hdist=1

Basically, if your FASTQ files organized and each will be treated identically, allow BBDuk to batch process them and change the input and output names automatically.

Great tool!
jazz710 is offline   Reply With Quote
Old 03-11-2015, 09:02 AM   #110
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

Just making a practical suggestion (not directly related to your request).

Rather than sequentially processing your files via a single command, if you have access to a cluster write a simple shell/job submit script to do 48 parallel submissions so you can be more efficient.

I won't be surprised if BBTools has such a setting already (that Brian may not have elaborated on so far ).
GenoMax is offline   Reply With Quote
Old 03-11-2015, 09:42 AM   #111
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

No such setting yet But I will consider adding it when I have time; it would be pretty useful for me, too.
Brian Bushnell is offline   Reply With Quote
Old 03-24-2015, 10:58 AM   #112
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
I typically use k=25, hdist=1 for adapter trimming and removal of other short synthetic contaminant sequences, just because some of them are very short (under 30bp). For long contaminant sequences (phiX, ecoli, etc) I always use k=31 to maximize the specificity; with 100bp reads you should get a match SOMEWHERE, especially if you allow a mismatch. Note that the "mm" or "maskmiddle" flag (which defaults to true for filtering) is roughly equivalent to allowing an additional +1 hamming distance for free.
Hi Brian,

A couple of quick questions. Since adapters might only have small fragments attached to the ends of the reads, shouldn't we use a smaller mink? The bbduk.sh help lists a default mink=6. Also, for adapter trimming should we use some hdist2? (Say hdist2=1)?
ysnapus is offline   Reply With Quote
Old 03-24-2015, 11:23 AM   #113
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by ysnapus View Post
Hi Brian,

A couple of quick questions. Since adapters might only have small fragments attached to the ends of the reads, shouldn't we use a smaller mink? The bbduk.sh help lists a default mink=6. Also, for adapter trimming should we use some hdist2? (Say hdist2=1)?
The current version has default mink listed as 0 (meaning disabled); it's trimq that's 6 by default.

Also, this is not really explained very well in the help, so let me state this clearly:

If you only set "hdist", that value will also be applied to "hdist2".
If you specify both "hdist" and "hdist2", they will be set independently.

So:

"hdist=2" will use 2 for both long and short kmers; "hdist=2 hdist2=1" will use 2 for long kmers and 1 for short kmers. Therefore, you don't need to set hdist2 unless you want it to be different from hdist.

The only reason to have "mink=11" rather than lower is to prevent false-positive trimming. At mink=6 and hdist=1, you have roughly 0.5% chance of false-positive trimming if you are using exactly one adapter sequence, which is still pretty low. It just depends on your tolerance for adapter sequence - if it's important to get rid of as much as possible, then a lower value of mink is better.

Normally, if the reads are a paired fragment library, I add the "tbo tpe" flags. "tbo" does a good job of getting the last few bases of adapter sequence that are too short to detect using kmer matching, down to even 1bp of adapter sequence.
Brian Bushnell is offline   Reply With Quote
Old 04-17-2015, 05:52 PM   #114
nepossiver
Junior Member
 
Location: São Paulo

Join Date: Oct 2012
Posts: 8
Default

I've started to experiment with BBDuk, and I would like some help understanding the output. Initially I understood "KTrimmed" as the results of kmer adapter removal, but after playing a bit it seems "KTrimmed" sums up kmer adapter removal and other trims, at least in some situations.

These tests are for 250bp MiSeq paired reads. BBDuk with minlen=51:

Code:
bbduk.sh -Xmx4g threads=4 in1=R1.fastq in2=R2.fastq out1=trimR1.fq out2=trimR2.fq outsingle=trimS.fq k=25 mink=12 hdist=1 ktrim=r ref=truseq.fa.gz,nextera.fa.gz minlen=51 tpe tbo

BBDuk version 34.86
Set threads to 4
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=4030m, used=86m

Added 161221 kmers; time: 	0.185 seconds.
Memory: free=3794m, used=322m

Input is being processed as paired
Started output streams:	0.009 seconds.
Processing time:   		28.745 seconds.

Input:                  	2967156 reads 		542374420 bases.
KTrimmed:               	132538 reads (4.47%) 	6891444 bases (1.27%)
Trimmed by overlap:     	9464 reads (0.32%) 	342404 bases (0.06%)
Result:                 	2859656 reads (96.38%) 	535140572 bases (98.67%)

Time:   			28.945 seconds.
Reads Processed:       2967k 	102.51k reads/sec
Bases Processed:        542m 	18.74m bases/sec
Then with minlen=201:

Code:
bbduk.sh -Xmx4g threads=4 in1=R1.fastq in2=R2.fastq out1=trimR1.fq out2=trimR2.fq outsingle=trimS.fq k=25 mink=12 hdist=1 ktrim=r ref=truseq.fa.gz,nextera.fa.gz minlen=201 tpe tbo

BBDuk version 34.86
Set threads to 4
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=4030m, used=86m

Added 161221 kmers; time: 	0.185 seconds.
Memory: free=3794m, used=322m

Input is being processed as paired
Started output streams:	0.011 seconds.
Processing time:   		23.263 seconds.

Input:                  	2967156 reads 		542374420 bases.
KTrimmed:               	1506815 reads (50.78%) 	184149438 bases (33.95%)
Trimmed by overlap:     	8290 reads (0.28%) 	273347 bases (0.05%)
Result:                 	1463616 reads (49.33%) 	357740553 bases (65.96%)

Time:   			23.465 seconds.
Reads Processed:       2967k 	126.45k reads/sec
Bases Processed:        542m 	23.11m bases/sec
Depending on minlen, "KTrimmed" reports very different numbers, but adapter contamination should be identical, am I correct? Wouldn't make more sense report reads removed for being too short under "QTrimmed", or maybe something as "LTrimmed"?

Finally, a warning to those unaware, and on the rare situation of using a computer with IBM Java (such as SUSE Enterprise): BBDuk (and GATK, Maven and some others) break under this Java, one has to install Oracle Java or OpenJava in order for them to work.

thanks for the fantastic tools, nep
nepossiver is offline   Reply With Quote
Old 04-19-2015, 11:15 PM   #115
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBDuk tracks which operation lead to particular bases being removed. So, if you have no minlength, a 250bp read trimmed via kmers to 190bp will yield 60bp ktrimmed; or, if it had no matching kmers but had low quality so that it was quality-trimmed to 190bp, it would yield 60bp qtrimmed.

But let's say you set "minlen=200". If the length drops below the threshold, all bases will be considered trimmed, and the entire read will be associated with the reason it was eliminated - so a low-quality read will increment qtrimmed by +250, while a read with adapter bases at position 190 would increment ktrimmed by +250. Furthermore, in the default mode (which can be overridden with the "rieb=f" [removeifeitherbad=false] flag), if either read is trimmed below the threshold, both reads are discarded - so if minlen=200 and read 2 was trimmed to 190bp, in a 2x250bp run, both would be discarded. So, 500bp would be added to either ktrimmed or qtrimmed, whichever triggered the removal.

I prefer this over having an "ltrimmed" line for bases that were removed because the reads were too short, because in that case, you can't tell what triggered the removal. Of course, each way has advantages - if you are only doing a single operation, "ltrimmed" would be preferable - but I usually run BBDuk with multiple operations per pass.
Brian Bushnell is offline   Reply With Quote
Old 04-24-2015, 03:02 PM   #116
nepossiver
Junior Member
 
Location: São Paulo

Join Date: Oct 2012
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
I prefer this over having an "ltrimmed" line for bases that were removed because the reads were too short, because in that case, you can't tell what triggered the removal. Of course, each way has advantages - if you are only doing a single operation, "ltrimmed" would be preferable - but I usually run BBDuk with multiple operations per pass.
Thanks for the response, I understand your reasoning. I guess I could call bbduk twice to do a "two-pass" trimming, first trimming by minlen only, then adapters+minlen. As BBTools are so fast it does not cost much anyway.

By the way, is tbo the same method as described in Front Genet. 2014; 5: 5?
nepossiver is offline   Reply With Quote
Old 04-24-2015, 03:24 PM   #117
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yes, it looks like the concept is the same, though the implementation is different as BBDuk uses BBMerge's overlapper code.
Brian Bushnell is offline   Reply With Quote
Old 05-14-2015, 03:03 AM   #118
IonTom
Member
 
Location: Germany

Join Date: Apr 2014
Posts: 32
Default

Dear Brian,

i just started using BBduk and am really happy with it.

There is just a question concerning RNASeq from the NextSeq500.

Which setting would you recommend for quality trimming, as the Phred scores seem quite high independent of the sample quality ? I have been trying trimq 15 so far.

Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?

Many thanks for your help

Last edited by IonTom; 05-14-2015 at 03:08 AM.
IonTom is offline   Reply With Quote
Old 05-14-2015, 04:51 AM   #119
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

Quote:
Originally Posted by IonTom View Post
Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?
repair.sh from BBMap can separate the unpaired reads.

Code:
$ repair.sh in1=read_1.q in2=read_2.fq out1=fixed_1.fq out2=fixed_2.fq outsingle=single.fq
GenoMax is offline   Reply With Quote
Old 05-14-2015, 10:28 AM   #120
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by IonTom View Post
There is just a question concerning RNASeq from the NextSeq500.

Which setting would you recommend for quality trimming, as the Phred scores seem quite high independent of the sample quality ? I have been trying trimq 15 so far.
It is difficult to effectively quality-trim NextSeq data because the quality scores are both binned and inaccurate (for V1; V2 is still binned, but more accurate). For V1 chemistry, on the raw reads, I would recommend 15, and that should be adequate. But it depends on what you are doing with the data. BBMap, for example, does not generally need any quality-trimming unless the data is really terrible.

If you want optimal quality-trimming for NextSeq data, and are willing to spend a little more time, and you have a reference transcriptome (for RNA-seq), I would suggest first recalibrating the quality scores, like this:

Code:
bbmap.sh ref=transcriptome.fa in=reads.fq out=mapped.sam reads=8m maxindel=200

calctruequality.sh in=mapped.sam

bbduk.sh in=reads.fq out=trimmed.fq recal qtrim=r trimq=15
In this case the quality will be recalibrated by BBDuk prior to trimming. If you choose to recalibrate quality, I highly recommend using the latest version of BBTools (34.94) as it has had some extensive improvements for recalibration recently, specifically for NextSeq.

Quote:
Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?

Many thanks for your help
For paired data, your input reads should be properly ordered prior to processing with BBDuk, and they will still be properly ordered after processing - pairs will always be kept together. In situations where one read is trimmed shorter than minlength and the other is not, both will be discarded. You can save the longer of the two reads by setting the "outs" stream, which will catch singletons.

If your pairing was broken prior to running BBDuk, then follow Genomax's suggestion and repair them first.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO