SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 132 04-18-2017 01:04 AM
Adapter trimming figo1019 RNA Sequencing 1 04-07-2014 10:58 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 04:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 07:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 12:53 PM

Reply
 
Thread Tools
Old 06-30-2017, 12:30 AM   #241
KseniaA
Junior Member
 
Location: Europe

Join Date: Jun 2017
Posts: 3
Default

Thank you GenoMax,

This really helped!

Ksenia
KseniaA is offline   Reply With Quote
Old 06-30-2017, 04:58 AM   #242
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 14
Default

Hi,

Is it possible to include some kind of filename wildcard in config files somehow?
I'd like to output base quality, stats, and some other histogram data for each invocation of bbmap/bbduk, which makes the command line quite long. I was hoping it would be possible to e.g. include lines like this (or with some other wildcard identifier) in a config file which is included on the command line instead (with config=config.txt):

statsfile={}.statsfile.txt
bqhist={}.bqhist.txt

to output differently named stats files and bqhist files for each input file.

I'm just curious if someone else has ever wanted to do this. Of course, I could just write a small wrapper script to launch bbmap/bbduk with all the settings I would keep in a config file. That's probably simpler in the end anyway....

Last edited by boulund; 06-30-2017 at 05:04 AM.
boulund is offline   Reply With Quote
Old 06-30-2017, 07:57 AM   #243
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,495
Default

You can use a standard unix loop to iterate over all files, grab the file basenames and use them in individual bbmap tools.
GenoMax is offline   Reply With Quote
Old 06-30-2017, 08:18 AM   #244
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 14
Default

Quote:
Originally Posted by GenoMax View Post
You can use a standard unix loop to iterate over all files, grab the file basenames and use them in individual bbmap tools.
Yes, naturally. Thanks for the help, but I think I'm not making myself entirely clear.

I was envisioning using a config file to contain all the standard parameters I always want to use, including the output and all stats/histogram files, but still assign in1= and in2= on the command line, like so:

Code:
bbduk.sh in1=file_1.fastq in2=file_2.fastq config=config.txt
and have it produce output files something like specified in config.txt:

Code:
out1={}.trimmed.fastq
out2={}.trimmed.fastq
statsfile={}.stats.txt
bqhist={}.bqhist.txt
This is just a small, scaled down, example. I actually want to produce essentially all the different stats and histogram outputs, which is why it's so tedious to write on the command line.

Of course, like I previously said, it's easy to just create a simple wrapper to do this for me, or use GNU parallel, or a bash for loop. I guess I'm just a bit lazy...
boulund is offline   Reply With Quote
Old 06-30-2017, 09:03 AM   #245
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Actually! I kind of dislike config files because I find them annoying. So nothing in BBTools requires them. But it does in fact support them, because sometimes they are convenient... particularly when your command line includes whitespace, or is more than however many characters your shell currently supports.

For most programs in BBTools, you can add the parameter "config=foo.txt". Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file.

I'm not currently planning to implement wildcards in BBTools, in general. They are a huge pain... I realize they're convenient, and certainly, my goal is to make BBTools as convenient as possible. However, it is not a good idea to sacrifice usability and readability (of the code) for a single odd use-case request. There are some programs (like comparesketch.sh) that will interpret everything on the command line without an "=" symbol as a reference sequence/sketch, so you can say "comparesketch.sh in=foo.fa *.sketch". But in general, they don't.

That said, some programs in BBTools support wildcards, where it makes sense (to me), for paired Ilumina files. They will replace "#" with a number. The goal is generally to make it easier to work with paired Illumina files.

-Brian

Last edited by Brian Bushnell; 06-30-2017 at 09:23 AM.
Brian Bushnell is offline   Reply With Quote
Old 06-30-2017, 09:33 AM   #246
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 14
Default

Quote:
Originally Posted by Brian Bushnell View Post
Actually! I kind of dislike config files because I find them annoying. So nothing in BBTools requires them. But it does in fact support them, because sometimes they are convenient... particularly when your command line includes whitespace, or is more than however many characters your shell currently supports.

For most programs in BBTools, you can add the parameter "config=foo.txt". Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file.

I'm not currently planning to implement wildcards in BBTools, in general. They are a huge pain... I realize they're convenient, and certainly, my goal is to make BBTools as convenient as possible. However, it is not a good idea to sacrifice usability and readability (of the code) for a single odd use-case request. There are some programs (like comparesketch.sh) that will interpret everything on the command line without an "=" symbol as a reference sequence/sketch, so you can say "comparesketch.sh in=foo.fa *.sketch". But in general, they don't.

That said, some programs in BBTools support wildcards, where it makes sense (to me), for paired Ilumina files. They will replace "#" with a number. The goal is generally to make it easier to work with paired Illumina files.

-Brian
Hi, thanks for your reply!

I understand your sentiment. I actually agree with you on the annoyance on config files for the most part, but just stumbled upon the config-file features of BBTools recently and was just curious to see what strange corner cases I could wring out of it .

Thanks for the exhaustive reply, it really answers all my questions completely. I'm a big fan of all the BBTools, use them almost daily!
boulund is offline   Reply With Quote
Old 07-14-2017, 07:41 AM   #247
netasha
Junior Member
 
Location: USA

Join Date: Jul 2011
Posts: 8
Default

Hi Brian,

I have questions on the parameter setting for bbduk2.sh.

I found that when I set both K-mer trimming and minlength and maxlength, for the input reads, if they don't contain the expected K-mer on either end of the read, however if they are falling in the correct size range, they will be kept for downstream analysis. In other words, the k-mer and length criterion are not "and" but "or". Is that how you design the tools? What if I want to meet both criterion? Only do it sequentially?

Thanks in advance!
netasha is offline   Reply With Quote
Old 07-14-2017, 09:36 AM   #248
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?
Brian Bushnell is offline   Reply With Quote
Old 07-14-2017, 09:43 AM   #249
netasha
Junior Member
 
Location: USA

Join Date: Jul 2011
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?
Thanks for your reply. Good to know that has been fixed! The version I used is BBDuk2 version 36.71.

A similar question is:
If I set up the following three parameters:
1) hammingdistance=$lprimerMM k=$llk lliteral=$lp
2) trimq=${QUAL} qtrim=rl
3) mlf=1.0

What will be kept in outm are those reads which either meet both 1) and 3) or meet both 2) and 3), not as what I expected: 1) and 2) and 3).

I would like to hear your thoughts and get updated as well.

Thank you!
netasha is offline   Reply With Quote
Old 07-14-2017, 09:55 AM   #250
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Oh, I should mention... I don't really touch BBDuk2's code any more. I'm planning on deprecating it, since it's a lot of effort to keep two almost-identical programs in sync, and BBDuk2's only advantage was doing left and right kmer-trimming at the same time, which is a really rare use-case and can be done in 2 passes with BBDuk anyway. So please use BBDuk instead. I'll delete bbduk2.sh in the next release.

As for your question, "outm" receives everything that fails the filter criteria. In this case, anything that matches a kmer (and thus gets left-trimmed) OR gets quality-trimmed should go to outm.
Brian Bushnell is offline   Reply With Quote
Old 07-14-2017, 11:02 AM   #251
netasha
Junior Member
 
Location: USA

Join Date: Jul 2011
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
Oh, I should mention... I don't really touch BBDuk2's code any more. I'm planning on deprecating it, since it's a lot of effort to keep two almost-identical programs in sync, and BBDuk2's only advantage was doing left and right kmer-trimming at the same time, which is a really rare use-case and can be done in 2 passes with BBDuk anyway. So please use BBDuk instead. I'll delete bbduk2.sh in the next release.

As for your question, "outm" receives everything that fails the filter criteria. In this case, anything that matches a kmer (and thus gets left-trimmed) OR gets quality-trimmed should go to outm.
Sounds reasonable. Then I will stay with bbduk.

As for quality-trimming, if I ask for bases below q20 from both left and right end will be trimmed, how do you define left end and right end for a 75 nt read?
netasha is offline   Reply With Quote
Old 07-14-2017, 11:25 AM   #252
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by netasha View Post
Sounds reasonable. Then I will stay with bbduk.

As for quality-trimming, if I ask for bases below q20 from both left and right end will be trimmed, how do you define left end and right end for a 75 nt read?
1) Generally, I consider Q20 too high for quality trimming and think it is probably detrimental to most analyses (not because removing low-quality sequence is bad, but because with Illumina sequence affects quality, and thus quality-trimming incurs sequence-bias). I tend to use ~Q12 as a maximum. But it depends on what you're doing and the quality of your data. Also, I no longer recommend setting "qtrim=rl", but rather "qtrim=r" (for reads that have not been recalibrated), because Illumina erroneously gives low quality scores to the first few bases. They are typically very high quality, but for whatever reason Illumina likes to assign them low quality scores. So you should not trim them (unless you recalibrate the quality scores so that they are accurate).

2) I define the left and right ends as what you see in a fastq file. So, the left end is 5' and the right end is 3'. If you have a 75bp read and set "qtrim=rl trimq=20", then on both ends, regions with average quality below 20 will be removed. It's a bit more nuanced than that - specifically, it maximizes the length of the central region with average quality above 20 that cannot be extended to include additional regions with average quality above 20, where average means logarithmic average after converting to error probability. This gives optimal results, if the quality scores are correct. But typically, setting "qtrim=rl trimq=20", you will see reads with bases under Q20 cut off from each end since Illumina quality scores generally form a hump-like pattern.

Last edited by Brian Bushnell; 07-14-2017 at 11:27 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-14-2017, 01:08 PM   #253
netasha
Junior Member
 
Location: USA

Join Date: Jul 2011
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
1) Generally, I consider Q20 too high for quality trimming and think it is probably detrimental to most analyses (not because removing low-quality sequence is bad, but because with Illumina sequence affects quality, and thus quality-trimming incurs sequence-bias). I tend to use ~Q12 as a maximum. But it depends on what you're doing and the quality of your data. Also, I no longer recommend setting "qtrim=rl", but rather "qtrim=r" (for reads that have not been recalibrated), because Illumina erroneously gives low quality scores to the first few bases. They are typically very high quality, but for whatever reason Illumina likes to assign them low quality scores. So you should not trim them (unless you recalibrate the quality scores so that they are accurate).

2) I define the left and right ends as what you see in a fastq file. So, the left end is 5' and the right end is 3'. If you have a 75bp read and set "qtrim=rl trimq=20", then on both ends, regions with average quality below 20 will be removed. It's a bit more nuanced than that - specifically, it maximizes the length of the central region with average quality above 20 that cannot be extended to include additional regions with average quality above 20, where average means logarithmic average after converting to error probability. This gives optimal results, if the quality scores are correct. But typically, setting "qtrim=rl trimq=20", you will see reads with bases under Q20 cut off from each end since Illumina quality scores generally form a hump-like pattern.
Thanks for your detailed explanation.
Those are really good suggestions and I will definitely take it! Thank you so much!
netasha is offline   Reply With Quote
Old 07-15-2017, 07:44 AM   #254
netasha
Junior Member
 
Location: USA

Join Date: Jul 2011
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?
Hi Brian,

I have updated the bbtools to the most recent version and switched to bbduk.sh.

I was trying to use the mink to ask for a subset of k-mer for trimming. However, once I added the setting of mink, I found the minlength and maxlength failed to work.

After quality trimming, it might affect the primer I was trying to trim. I believe mink can save those reads with a small number of bases trimmed.

Do you have any suggestions?

Thanks!
netasha is offline   Reply With Quote
Old 07-20-2017, 06:41 AM   #255
Stimpsky
Junior Member
 
Location: Singapore

Join Date: Nov 2009
Posts: 2
Default

Hi,

I am new to bbduk and I have been playing with a simple test dataset to better understand bbduk options and the order of the individual processing steps.
If I use a test fastq file:
Code:
@test1
AAACCCTTTATATAACCTT
+
IIIIIIIIIIIIIIIIIII
with the sequence ATAT and two hypothetical adapters,
AAACCCTTT and AACCTT on the left and right side respectively, I would assume that

Code:
./bbduk.sh in=test.fq out=stdout.fq literal=AACCTT ktrim=r k=6
will result in AAACCCTTTATAT, which is what I observe:

Code:
@test1
AAACCCTTTATAT
+
IIIIIIIIIIIII

Input:                  	1 reads 		19 bases.
KTrimmed:               	1 reads (100.00%) 	6 bases (31.58%)
Total Removed:          	0 reads (0.00%) 	6 bases (31.58%)
Result:                 	1 reads (100.00%) 	13 bases (68.42%)
If I use a slightly longer version of the right adapter:
Code:
@test1
AAACCCTTTATATAAACCCTTT
+
IIIIIIIIIIIIIIIIIIIIII
this
Code:
./bbduk.sh in=test.fq out=stdout.fq literal=AAACCCTTT ktrim=r k=9
does not result in the expected AAACCCTTTATAT, but removes all bases:

Code:
Input:                  	1 reads 		22 bases.
KTrimmed:               	1 reads (100.00%) 	22 bases (100.00%)
Total Removed:          	1 reads (100.00%) 	22 bases (100.00%)
Result:                 	0 reads (0.00%) 	0 bases (0.00%)
and outm reports the following:
Code:
@test1
A
+
I
If I use ktrim=n instead of ktrim=r, I get the expected result i.e. NNNNNNNNNATATNNNNNNNNN. Am I doing something wrong?

/I am using BBDuk version 37.36 on CentOS 7
Stimpsky is offline   Reply With Quote
Old 07-20-2017, 06:49 AM   #256
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,495
Default

Because your literal adapter is also matching the beginning of the read entire sequence to the right is removed (ktrim=r). If you change the sequence at the beginning by one base you will see that the initial part is retained as you expect.

Last edited by GenoMax; 07-20-2017 at 06:59 AM.
GenoMax is offline   Reply With Quote
Old 07-20-2017, 07:28 AM   #257
Stimpsky
Junior Member
 
Location: Singapore

Join Date: Nov 2009
Posts: 2
Default

Ah OK, that explains it, nevermind. I thought ktrim=r would go from right to left and stop at the first hit. Thanks for the quick reply!
Stimpsky is offline   Reply With Quote
Old 08-11-2017, 03:34 AM   #258
cuencam
Junior Member
 
Location: Zurich

Join Date: Aug 2017
Posts: 5
Default

Hi Brian,
I would like to switch from solexaqa to BBduk for my read trim and filtering option, however since we are working with bacterial strains typification we would like to have the option to only keep trimmed reads where no individual base has a quality lower than a defined threshold (instead of average region quality). Could this be done with BBduk?

Thanks!
cuencam is offline   Reply With Quote
Old 08-14-2017, 10:25 AM   #259
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi cuencam,

It is currently not possible to do this, other than discarding all reads that have any undefined (quality 0) bases with "maxns=0". I never saw a reason to discard all reads with a single base below a specified cutoff. It would be simple enough to add (or implement via a custom script), but can you explain why you're doing it? The "average region quality" method is the best at maximizing coverage while minimizing the total number of errors.

Edit - anyway, it was quick to add, so it will be in the next release as the "mbq" ("minbasequality") flag.

Last edited by Brian Bushnell; 08-14-2017 at 10:44 AM.
Brian Bushnell is offline   Reply With Quote
Old 08-15-2017, 12:29 AM   #260
cuencam
Junior Member
 
Location: Zurich

Join Date: Aug 2017
Posts: 5
Default

Hi Brian,
Thanks for such a quick response, and implementing this so fast!
Our interest is that during SNV calling in low coverage regions or low abundance taxa (in metagenomics) base quality can be more important than coverage. This way we can assess properly different alleles and avoid the creation of artifacts
Cheers!

Edit - Do you have an estimated next release date?

Last edited by cuencam; 08-15-2017 at 12:53 AM. Reason: Adding a question
cuencam is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO