Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Thank you GenoMax,

    This really helped!

    Ksenia

    Comment


    • Hi,

      Is it possible to include some kind of filename wildcard in config files somehow?
      I'd like to output base quality, stats, and some other histogram data for each invocation of bbmap/bbduk, which makes the command line quite long. I was hoping it would be possible to e.g. include lines like this (or with some other wildcard identifier) in a config file which is included on the command line instead (with config=config.txt):

      statsfile={}.statsfile.txt
      bqhist={}.bqhist.txt

      to output differently named stats files and bqhist files for each input file.

      I'm just curious if someone else has ever wanted to do this. Of course, I could just write a small wrapper script to launch bbmap/bbduk with all the settings I would keep in a config file. That's probably simpler in the end anyway....
      Last edited by boulund; 06-30-2017, 05:04 AM.

      Comment


      • You can use a standard unix loop to iterate over all files, grab the file basenames and use them in individual bbmap tools.

        Comment


        • Originally posted by GenoMax View Post
          You can use a standard unix loop to iterate over all files, grab the file basenames and use them in individual bbmap tools.
          Yes, naturally. Thanks for the help, but I think I'm not making myself entirely clear.

          I was envisioning using a config file to contain all the standard parameters I always want to use, including the output and all stats/histogram files, but still assign in1= and in2= on the command line, like so:

          Code:
          bbduk.sh in1=file_1.fastq in2=file_2.fastq config=config.txt
          and have it produce output files something like specified in config.txt:

          Code:
          out1={}.trimmed.fastq
          out2={}.trimmed.fastq
          statsfile={}.stats.txt
          bqhist={}.bqhist.txt
          This is just a small, scaled down, example. I actually want to produce essentially all the different stats and histogram outputs, which is why it's so tedious to write on the command line.

          Of course, like I previously said, it's easy to just create a simple wrapper to do this for me, or use GNU parallel, or a bash for loop. I guess I'm just a bit lazy...

          Comment


          • Actually! I kind of dislike config files because I find them annoying. So nothing in BBTools requires them. But it does in fact support them, because sometimes they are convenient... particularly when your command line includes whitespace, or is more than however many characters your shell currently supports.

            For most programs in BBTools, you can add the parameter "config=foo.txt". Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file.

            I'm not currently planning to implement wildcards in BBTools, in general. They are a huge pain... I realize they're convenient, and certainly, my goal is to make BBTools as convenient as possible. However, it is not a good idea to sacrifice usability and readability (of the code) for a single odd use-case request. There are some programs (like comparesketch.sh) that will interpret everything on the command line without an "=" symbol as a reference sequence/sketch, so you can say "comparesketch.sh in=foo.fa *.sketch". But in general, they don't.

            That said, some programs in BBTools support wildcards, where it makes sense (to me), for paired Ilumina files. They will replace "#" with a number. The goal is generally to make it easier to work with paired Illumina files.

            -Brian
            Last edited by Brian Bushnell; 06-30-2017, 09:23 AM.

            Comment


            • Originally posted by Brian Bushnell View Post
              Actually! I kind of dislike config files because I find them annoying. So nothing in BBTools requires them. But it does in fact support them, because sometimes they are convenient... particularly when your command line includes whitespace, or is more than however many characters your shell currently supports.

              For most programs in BBTools, you can add the parameter "config=foo.txt". Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file.

              I'm not currently planning to implement wildcards in BBTools, in general. They are a huge pain... I realize they're convenient, and certainly, my goal is to make BBTools as convenient as possible. However, it is not a good idea to sacrifice usability and readability (of the code) for a single odd use-case request. There are some programs (like comparesketch.sh) that will interpret everything on the command line without an "=" symbol as a reference sequence/sketch, so you can say "comparesketch.sh in=foo.fa *.sketch". But in general, they don't.

              That said, some programs in BBTools support wildcards, where it makes sense (to me), for paired Ilumina files. They will replace "#" with a number. The goal is generally to make it easier to work with paired Illumina files.

              -Brian
              Hi, thanks for your reply!

              I understand your sentiment. I actually agree with you on the annoyance on config files for the most part, but just stumbled upon the config-file features of BBTools recently and was just curious to see what strange corner cases I could wring out of it .

              Thanks for the exhaustive reply, it really answers all my questions completely. I'm a big fan of all the BBTools, use them almost daily!

              Comment


              • Hi Brian,

                I have questions on the parameter setting for bbduk2.sh.

                I found that when I set both K-mer trimming and minlength and maxlength, for the input reads, if they don't contain the expected K-mer on either end of the read, however if they are falling in the correct size range, they will be kept for downstream analysis. In other words, the k-mer and length criterion are not "and" but "or". Is that how you design the tools? What if I want to meet both criterion? Only do it sequentially?

                Thanks in advance!

                Comment


                • There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?

                  Comment


                  • Originally posted by Brian Bushnell View Post
                    There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?
                    Thanks for your reply. Good to know that has been fixed! The version I used is BBDuk2 version 36.71.

                    A similar question is:
                    If I set up the following three parameters:
                    1) hammingdistance=$lprimerMM k=$llk lliteral=$lp
                    2) trimq=${QUAL} qtrim=rl
                    3) mlf=1.0

                    What will be kept in outm are those reads which either meet both 1) and 3) or meet both 2) and 3), not as what I expected: 1) and 2) and 3).

                    I would like to hear your thoughts and get updated as well.

                    Thank you!

                    Comment


                    • Oh, I should mention... I don't really touch BBDuk2's code any more. I'm planning on deprecating it, since it's a lot of effort to keep two almost-identical programs in sync, and BBDuk2's only advantage was doing left and right kmer-trimming at the same time, which is a really rare use-case and can be done in 2 passes with BBDuk anyway. So please use BBDuk instead. I'll delete bbduk2.sh in the next release.

                      As for your question, "outm" receives everything that fails the filter criteria. In this case, anything that matches a kmer (and thus gets left-trimmed) OR gets quality-trimmed should go to outm.

                      Comment


                      • Originally posted by Brian Bushnell View Post
                        Oh, I should mention... I don't really touch BBDuk2's code any more. I'm planning on deprecating it, since it's a lot of effort to keep two almost-identical programs in sync, and BBDuk2's only advantage was doing left and right kmer-trimming at the same time, which is a really rare use-case and can be done in 2 passes with BBDuk anyway. So please use BBDuk instead. I'll delete bbduk2.sh in the next release.

                        As for your question, "outm" receives everything that fails the filter criteria. In this case, anything that matches a kmer (and thus gets left-trimmed) OR gets quality-trimmed should go to outm.
                        Sounds reasonable. Then I will stay with bbduk.

                        As for quality-trimming, if I ask for bases below q20 from both left and right end will be trimmed, how do you define left end and right end for a 75 nt read?

                        Comment


                        • Originally posted by netasha View Post
                          Sounds reasonable. Then I will stay with bbduk.

                          As for quality-trimming, if I ask for bases below q20 from both left and right end will be trimmed, how do you define left end and right end for a 75 nt read?
                          1) Generally, I consider Q20 too high for quality trimming and think it is probably detrimental to most analyses (not because removing low-quality sequence is bad, but because with Illumina sequence affects quality, and thus quality-trimming incurs sequence-bias). I tend to use ~Q12 as a maximum. But it depends on what you're doing and the quality of your data. Also, I no longer recommend setting "qtrim=rl", but rather "qtrim=r" (for reads that have not been recalibrated), because Illumina erroneously gives low quality scores to the first few bases. They are typically very high quality, but for whatever reason Illumina likes to assign them low quality scores. So you should not trim them (unless you recalibrate the quality scores so that they are accurate).

                          2) I define the left and right ends as what you see in a fastq file. So, the left end is 5' and the right end is 3'. If you have a 75bp read and set "qtrim=rl trimq=20", then on both ends, regions with average quality below 20 will be removed. It's a bit more nuanced than that - specifically, it maximizes the length of the central region with average quality above 20 that cannot be extended to include additional regions with average quality above 20, where average means logarithmic average after converting to error probability. This gives optimal results, if the quality scores are correct. But typically, setting "qtrim=rl trimq=20", you will see reads with bases under Q20 cut off from each end since Illumina quality scores generally form a hump-like pattern.
                          Last edited by Brian Bushnell; 07-14-2017, 11:27 AM.

                          Comment


                          • Originally posted by Brian Bushnell View Post
                            1) Generally, I consider Q20 too high for quality trimming and think it is probably detrimental to most analyses (not because removing low-quality sequence is bad, but because with Illumina sequence affects quality, and thus quality-trimming incurs sequence-bias). I tend to use ~Q12 as a maximum. But it depends on what you're doing and the quality of your data. Also, I no longer recommend setting "qtrim=rl", but rather "qtrim=r" (for reads that have not been recalibrated), because Illumina erroneously gives low quality scores to the first few bases. They are typically very high quality, but for whatever reason Illumina likes to assign them low quality scores. So you should not trim them (unless you recalibrate the quality scores so that they are accurate).

                            2) I define the left and right ends as what you see in a fastq file. So, the left end is 5' and the right end is 3'. If you have a 75bp read and set "qtrim=rl trimq=20", then on both ends, regions with average quality below 20 will be removed. It's a bit more nuanced than that - specifically, it maximizes the length of the central region with average quality above 20 that cannot be extended to include additional regions with average quality above 20, where average means logarithmic average after converting to error probability. This gives optimal results, if the quality scores are correct. But typically, setting "qtrim=rl trimq=20", you will see reads with bases under Q20 cut off from each end since Illumina quality scores generally form a hump-like pattern.
                            Thanks for your detailed explanation.
                            Those are really good suggestions and I will definitely take it! Thank you so much!

                            Comment


                            • Originally posted by Brian Bushnell View Post
                              There was a bug in a previous version of BBDuk where minlength/maxlength were only checked if trimming occurred. That should be fixed in the latest version. What version are you using?
                              Hi Brian,

                              I have updated the bbtools to the most recent version and switched to bbduk.sh.

                              I was trying to use the mink to ask for a subset of k-mer for trimming. However, once I added the setting of mink, I found the minlength and maxlength failed to work.

                              After quality trimming, it might affect the primer I was trying to trim. I believe mink can save those reads with a small number of bases trimmed.

                              Do you have any suggestions?

                              Thanks!

                              Comment


                              • Hi,

                                I am new to bbduk and I have been playing with a simple test dataset to better understand bbduk options and the order of the individual processing steps.
                                If I use a test fastq file:
                                Code:
                                @test1
                                [COLOR="Red"]AAACCCTTT[/COLOR]ATAT[COLOR="Red"]AACCTT[/COLOR]
                                +
                                IIIIIIIIIIIIIIIIIII
                                with the sequence ATAT and two hypothetical adapters,
                                AAACCCTTT and AACCTT on the left and right side respectively, I would assume that

                                Code:
                                ./bbduk.sh in=test.fq out=stdout.fq literal=AACCTT ktrim=r k=6
                                will result in AAACCCTTTATAT, which is what I observe:

                                Code:
                                @test1
                                AAACCCTTTATAT
                                +
                                IIIIIIIIIIIII
                                
                                Input:                  	1 reads 		19 bases.
                                KTrimmed:               	1 reads (100.00%) 	6 bases (31.58%)
                                Total Removed:          	0 reads (0.00%) 	6 bases (31.58%)
                                Result:                 	1 reads (100.00%) 	13 bases (68.42%)
                                If I use a slightly longer version of the right adapter:
                                Code:
                                @test1
                                [COLOR="Red"]AAACCCTTT[/COLOR]ATAT[COLOR="Red"]AAACCCTTT[/COLOR]
                                +
                                IIIIIIIIIIIIIIIIIIIIII
                                this
                                Code:
                                ./bbduk.sh in=test.fq out=stdout.fq literal=AAACCCTTT ktrim=r k=9
                                does not result in the expected AAACCCTTTATAT, but removes all bases:

                                Code:
                                Input:                  	1 reads 		22 bases.
                                KTrimmed:               	1 reads (100.00%) 	22 bases (100.00%)
                                Total Removed:          	1 reads (100.00%) 	22 bases (100.00%)
                                Result:                 	0 reads (0.00%) 	0 bases (0.00%)
                                and outm reports the following:
                                Code:
                                @test1
                                A
                                +
                                I
                                If I use ktrim=n instead of ktrim=r, I get the expected result i.e. NNNNNNNNNATATNNNNNNNNN. Am I doing something wrong?

                                /I am using BBDuk version 37.36 on CentOS 7

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                71 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                80 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X