Seqanswers Leaderboard Ad

**KseniaA** · 06-30-2017, 12:30 AM

Thank you GenoMax,

This really helped!

Ksenia

**boulund** · 06-30-2017, 04:58 AM

Hi,

Is it possible to include some kind of filename wildcard in config files somehow?
I'd like to output base quality, stats, and some other histogram data for each invocation of bbmap/bbduk, which makes the command line quite long. I was hoping it would be possible to e.g. include lines like this (or with some other wildcard identifier) in a config file which is included on the command line instead (with config=config.txt):

statsfile={}.statsfile.txt
bqhist={}.bqhist.txt

to output differently named stats files and bqhist files for each input file.

I'm just curious if someone else has ever wanted to do this. Of course, I could just write a small wrapper script to launch bbmap/bbduk with all the settings I would keep in a config file. That's probably simpler in the end anyway....

**GenoMax** · 06-30-2017, 07:57 AM

You can use a standard unix loop to iterate over all files, grab the file basenames and use them in individual bbmap tools.

**boulund** · 06-30-2017, 08:18 AM

Originally posted by GenoMax View Post

You can use a standard unix loop to iterate over all files, grab the file basenames and use them in individual bbmap tools.

Yes, naturally. Thanks for the help, but I think I'm not making myself entirely clear.

I was envisioning using a config file to contain all the standard parameters I always want to use, including the output and all stats/histogram files, but still assign in1= and in2= on the command line, like so:

Code:

bbduk.sh in1=file_1.fastq in2=file_2.fastq config=config.txt

and have it produce output files something like specified in config.txt:

Code:

out1={}.trimmed.fastq
out2={}.trimmed.fastq
statsfile={}.stats.txt
bqhist={}.bqhist.txt

This is just a small, scaled down, example. I actually want to produce essentially all the different stats and histogram outputs, which is why it's so tedious to write on the command line.

Of course, like I previously said, it's easy to just create a simple wrapper to do this for me, or use GNU parallel, or a bash for loop. I guess I'm just a bit lazy...

**Brian Bushnell** · 06-30-2017, 09:03 AM

Actually! I kind of dislike config files because I find them annoying. So nothing in BBTools requires them. But it does in fact support them, because sometimes they are convenient... particularly when your command line includes whitespace, or is more than however many characters your shell currently supports.

For most programs in BBTools, you can add the parameter "config=foo.txt". Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file.

I'm not currently planning to implement wildcards in BBTools, in general. They are a huge pain... I realize they're convenient, and certainly, my goal is to make BBTools as convenient as possible. However, it is not a good idea to sacrifice usability and readability (of the code) for a single odd use-case request. There are some programs (like comparesketch.sh) that will interpret everything on the command line without an "=" symbol as a reference sequence/sketch, so you can say "comparesketch.sh in=foo.fa *.sketch". But in general, they don't.

That said, some programs in BBTools support wildcards, where it makes sense (to me), for paired Ilumina files. They will replace "#" with a number. The goal is generally to make it easier to work with paired Illumina files.

-Brian

**boulund** · 06-30-2017, 09:33 AM

Originally posted by Brian Bushnell View Post

Actually! I kind of dislike config files because I find them annoying. So nothing in BBTools requires them. But it does in fact support them, because sometimes they are convenient... particularly when your command line includes whitespace, or is more than however many characters your shell currently supports.

For most programs in BBTools, you can add the parameter "config=foo.txt". Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file.

I'm not currently planning to implement wildcards in BBTools, in general. They are a huge pain... I realize they're convenient, and certainly, my goal is to make BBTools as convenient as possible. However, it is not a good idea to sacrifice usability and readability (of the code) for a single odd use-case request. There are some programs (like comparesketch.sh) that will interpret everything on the command line without an "=" symbol as a reference sequence/sketch, so you can say "comparesketch.sh in=foo.fa *.sketch". But in general, they don't.

That said, some programs in BBTools support wildcards, where it makes sense (to me), for paired Ilumina files. They will replace "#" with a number. The goal is generally to make it easier to work with paired Illumina files.

-Brian

Hi, thanks for your reply!

I understand your sentiment. I actually agree with you on the annoyance on config files for the most part, but just stumbled upon the config-file features of BBTools recently and was just curious to see what strange corner cases I could wring out of it

.

Thanks for the exhaustive reply, it really answers all my questions completely. I'm a big fan of all the BBTools, use them almost daily!

**netasha** · 07-14-2017, 07:41 AM

Hi Brian,

I have questions on the parameter setting for bbduk2.sh.

I found that when I set both K-mer trimming and minlength and maxlength, for the input reads, if they don't contain the expected K-mer on either end of the read, however if they are falling in the correct size range, they will be kept for downstream analysis. In other words, the k-mer and length criterion are not "and" but "or". Is that how you design the tools? What if I want to meet both criterion? Only do it sequentially?

Thanks in advance!

**Brian Bushnell** · 07-14-2017, 09:36 AM

There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?

**netasha** · 07-14-2017, 09:43 AM

Originally posted by Brian Bushnell View Post

There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?

Thanks for your reply. Good to know that has been fixed! The version I used is BBDuk2 version 36.71.

A similar question is:
If I set up the following three parameters:
1) hammingdistance=$lprimerMM k=$llk lliteral=$lp
2) trimq=${QUAL} qtrim=rl
3) mlf=1.0

What will be kept in outm are those reads which either meet both 1) and 3) or meet both 2) and 3), not as what I expected: 1) and 2) and 3).

I would like to hear your thoughts and get updated as well.

Thank you!

**Brian Bushnell** · 07-14-2017, 09:55 AM

Oh, I should mention... I don't really touch BBDuk2's code any more. I'm planning on deprecating it, since it's a lot of effort to keep two almost-identical programs in sync, and BBDuk2's only advantage was doing left and right kmer-trimming at the same time, which is a really rare use-case and can be done in 2 passes with BBDuk anyway. So please use BBDuk instead. I'll delete bbduk2.sh in the next release.

As for your question, "outm" receives everything that fails the filter criteria. In this case, anything that matches a kmer (and thus gets left-trimmed) OR gets quality-trimmed should go to outm.

**netasha** · 07-14-2017, 11:02 AM

Originally posted by Brian Bushnell View Post

Oh, I should mention... I don't really touch BBDuk2's code any more. I'm planning on deprecating it, since it's a lot of effort to keep two almost-identical programs in sync, and BBDuk2's only advantage was doing left and right kmer-trimming at the same time, which is a really rare use-case and can be done in 2 passes with BBDuk anyway. So please use BBDuk instead. I'll delete bbduk2.sh in the next release.

As for your question, "outm" receives everything that fails the filter criteria. In this case, anything that matches a kmer (and thus gets left-trimmed) OR gets quality-trimmed should go to outm.

Sounds reasonable. Then I will stay with bbduk.

As for quality-trimming, if I ask for bases below q20 from both left and right end will be trimmed, how do you define left end and right end for a 75 nt read?

**Brian Bushnell** · 07-14-2017, 11:25 AM

Originally posted by netasha View Post

Sounds reasonable. Then I will stay with bbduk.

As for quality-trimming, if I ask for bases below q20 from both left and right end will be trimmed, how do you define left end and right end for a 75 nt read?

1) Generally, I consider Q20 too high for quality trimming and think it is probably detrimental to most analyses (not because removing low-quality sequence is bad, but because with Illumina sequence affects quality, and thus quality-trimming incurs sequence-bias). I tend to use ~Q12 as a maximum. But it depends on what you're doing and the quality of your data. Also, I no longer recommend setting "qtrim=rl", but rather "qtrim=r" (for reads that have not been recalibrated), because Illumina erroneously gives low quality scores to the first few bases. They are typically very high quality, but for whatever reason Illumina likes to assign them low quality scores. So you should not trim them (unless you recalibrate the quality scores so that they are accurate).

2) I define the left and right ends as what you see in a fastq file. So, the left end is 5' and the right end is 3'. If you have a 75bp read and set "qtrim=rl trimq=20", then on both ends, regions with average quality below 20 will be removed. It's a bit more nuanced than that - specifically, it maximizes the length of the central region with average quality above 20 that cannot be extended to include additional regions with average quality above 20, where average means logarithmic average after converting to error probability. This gives optimal results, if the quality scores are correct. But typically, setting "qtrim=rl trimq=20", you will see reads with bases under Q20 cut off from each end since Illumina quality scores generally form a hump-like pattern.

**netasha** · 07-14-2017, 01:08 PM

Originally posted by Brian Bushnell View Post

1) Generally, I consider Q20 too high for quality trimming and think it is probably detrimental to most analyses (not because removing low-quality sequence is bad, but because with Illumina sequence affects quality, and thus quality-trimming incurs sequence-bias). I tend to use ~Q12 as a maximum. But it depends on what you're doing and the quality of your data. Also, I no longer recommend setting "qtrim=rl", but rather "qtrim=r" (for reads that have not been recalibrated), because Illumina erroneously gives low quality scores to the first few bases. They are typically very high quality, but for whatever reason Illumina likes to assign them low quality scores. So you should not trim them (unless you recalibrate the quality scores so that they are accurate).

2) I define the left and right ends as what you see in a fastq file. So, the left end is 5' and the right end is 3'. If you have a 75bp read and set "qtrim=rl trimq=20", then on both ends, regions with average quality below 20 will be removed. It's a bit more nuanced than that - specifically, it maximizes the length of the central region with average quality above 20 that cannot be extended to include additional regions with average quality above 20, where average means logarithmic average after converting to error probability. This gives optimal results, if the quality scores are correct. But typically, setting "qtrim=rl trimq=20", you will see reads with bases under Q20 cut off from each end since Illumina quality scores generally form a hump-like pattern.

Thanks for your detailed explanation.
Those are really good suggestions and I will definitely take it! Thank you so much!

**netasha** · 07-15-2017, 07:44 AM

Originally posted by Brian Bushnell View Post

There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?

Hi Brian,

I have updated the bbtools to the most recent version and switched to bbduk.sh.

I was trying to use the mink to ask for a subset of k-mer for trimming. However, once I added the setting of mink, I found the minlength and maxlength failed to work.

After quality trimming, it might affect the primer I was trying to trim. I believe mink can save those reads with a small number of bases trimmed.

Do you have any suggestions?

Thanks!

**Stimpsky** · 07-20-2017, 06:41 AM

Hi,

I am new to bbduk and I have been playing with a simple test dataset to better understand bbduk options and the order of the individual processing steps.
If I use a test fastq file:

Code:

@test1
[COLOR="Red"]AAACCCTTT[/COLOR]ATAT[COLOR="Red"]AACCTT[/COLOR]
+
IIIIIIIIIIIIIIIIIII

with the sequence ATAT and two hypothetical adapters,
AAACCCTTT and AACCTT on the left and right side respectively, I would assume that

Code:

./bbduk.sh in=test.fq out=stdout.fq literal=AACCTT ktrim=r k=6

will result in AAACCCTTTATAT, which is what I observe:

Code:

@test1
AAACCCTTTATAT
+
IIIIIIIIIIIII

Input:                  	1 reads 		19 bases.
KTrimmed:               	1 reads (100.00%) 	6 bases (31.58%)
Total Removed:          	0 reads (0.00%) 	6 bases (31.58%)
Result:                 	1 reads (100.00%) 	13 bases (68.42%)

If I use a slightly longer version of the right adapter:

Code:

@test1
[COLOR="Red"]AAACCCTTT[/COLOR]ATAT[COLOR="Red"]AAACCCTTT[/COLOR]
+
IIIIIIIIIIIIIIIIIIIIII

this

Code:

./bbduk.sh in=test.fq out=stdout.fq literal=AAACCCTTT ktrim=r k=9

does not result in the expected AAACCCTTTATAT, but removes all bases:

Code:

Input:                  	1 reads 		22 bases.
KTrimmed:               	1 reads (100.00%) 	22 bases (100.00%)
Total Removed:          	1 reads (100.00%) 	22 bases (100.00%)
Result:                 	0 reads (0.00%) 	0 bases (0.00%)

and outm reports the following:

Code:

@test1
A
+
I

If I use ktrim=n instead of ktrim=r, I get the expected result i.e. NNNNNNNNNATATNNNNNNNNN. Am I doing something wrong?

/I am using BBDuk version 37.36 on CentOS 7

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News