Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ah! You have "out1m" rather than "outm1". Hmmm, maybe I should make that also acceptable.

    Incidentally, in most case you can just write this: "outm=matched#.fq" which will be translated to "outm=matched1.fq outm2=matched2.fq", to shorten the command lines. You can do that for input also.

    Comment


    • I was thinking out1= and out1m= would be the logical pair. Should have tried the other way.

      Can you add the info above to the original post on page 1 when you have a chance (I refer to the examples there when I am looking for something new and many probably look at that first post too)?

      Comment


      • OK, I updated it.

        Comment


        • I'd like to make a brief suggestion, thought it's obviously not important in the grand scheme.

          I thought it would be nice if there was a batch setting for BBDuk. Currently, I have 48 fastq files all with nice name (Sample1.fq, Sample2.fq, etc).

          It would be great if I could run something like:

          bbduk.sh -Xmx1g batchdir=/fastqDir in=<>.fq out=<>.clean.fq ref=adapters.fa ktrim=r k=28 mink=12 hdist=1

          Basically, if your FASTQ files organized and each will be treated identically, allow BBDuk to batch process them and change the input and output names automatically.

          Great tool!

          Comment


          • Just making a practical suggestion (not directly related to your request).

            Rather than sequentially processing your files via a single command, if you have access to a cluster write a simple shell/job submit script to do 48 parallel submissions so you can be more efficient.

            I won't be surprised if BBTools has such a setting already (that Brian may not have elaborated on so far ).

            Comment


            • No such setting yet But I will consider adding it when I have time; it would be pretty useful for me, too.

              Comment


              • Originally posted by Brian Bushnell View Post
                I typically use k=25, hdist=1 for adapter trimming and removal of other short synthetic contaminant sequences, just because some of them are very short (under 30bp). For long contaminant sequences (phiX, ecoli, etc) I always use k=31 to maximize the specificity; with 100bp reads you should get a match SOMEWHERE, especially if you allow a mismatch. Note that the "mm" or "maskmiddle" flag (which defaults to true for filtering) is roughly equivalent to allowing an additional +1 hamming distance for free.
                Hi Brian,

                A couple of quick questions. Since adapters might only have small fragments attached to the ends of the reads, shouldn't we use a smaller mink? The bbduk.sh help lists a default mink=6. Also, for adapter trimming should we use some hdist2? (Say hdist2=1)?

                Comment


                • Originally posted by ysnapus View Post
                  Hi Brian,

                  A couple of quick questions. Since adapters might only have small fragments attached to the ends of the reads, shouldn't we use a smaller mink? The bbduk.sh help lists a default mink=6. Also, for adapter trimming should we use some hdist2? (Say hdist2=1)?
                  The current version has default mink listed as 0 (meaning disabled); it's trimq that's 6 by default.

                  Also, this is not really explained very well in the help, so let me state this clearly:

                  If you only set "hdist", that value will also be applied to "hdist2".
                  If you specify both "hdist" and "hdist2", they will be set independently.

                  So:

                  "hdist=2" will use 2 for both long and short kmers; "hdist=2 hdist2=1" will use 2 for long kmers and 1 for short kmers. Therefore, you don't need to set hdist2 unless you want it to be different from hdist.

                  The only reason to have "mink=11" rather than lower is to prevent false-positive trimming. At mink=6 and hdist=1, you have roughly 0.5% chance of false-positive trimming if you are using exactly one adapter sequence, which is still pretty low. It just depends on your tolerance for adapter sequence - if it's important to get rid of as much as possible, then a lower value of mink is better.

                  Normally, if the reads are a paired fragment library, I add the "tbo tpe" flags. "tbo" does a good job of getting the last few bases of adapter sequence that are too short to detect using kmer matching, down to even 1bp of adapter sequence.

                  Comment


                  • I've started to experiment with BBDuk, and I would like some help understanding the output. Initially I understood "KTrimmed" as the results of kmer adapter removal, but after playing a bit it seems "KTrimmed" sums up kmer adapter removal and other trims, at least in some situations.

                    These tests are for 250bp MiSeq paired reads. BBDuk with minlen=51:

                    Code:
                    bbduk.sh -Xmx4g threads=4 in1=R1.fastq in2=R2.fastq out1=trimR1.fq out2=trimR2.fq outsingle=trimS.fq k=25 mink=12 hdist=1 ktrim=r ref=truseq.fa.gz,nextera.fa.gz minlen=51 tpe tbo
                    
                    BBDuk version 34.86
                    Set threads to 4
                    maskMiddle was disabled because useShortKmers=true
                    Initial:
                    Memory: free=4030m, used=86m
                    
                    Added 161221 kmers; time: 	0.185 seconds.
                    Memory: free=3794m, used=322m
                    
                    Input is being processed as paired
                    Started output streams:	0.009 seconds.
                    Processing time:   		28.745 seconds.
                    
                    Input:                  	2967156 reads 		542374420 bases.
                    KTrimmed:               	132538 reads (4.47%) 	6891444 bases (1.27%)
                    Trimmed by overlap:     	9464 reads (0.32%) 	342404 bases (0.06%)
                    Result:                 	2859656 reads (96.38%) 	535140572 bases (98.67%)
                    
                    Time:   			28.945 seconds.
                    Reads Processed:       2967k 	102.51k reads/sec
                    Bases Processed:        542m 	18.74m bases/sec
                    Then with minlen=201:

                    Code:
                    bbduk.sh -Xmx4g threads=4 in1=R1.fastq in2=R2.fastq out1=trimR1.fq out2=trimR2.fq outsingle=trimS.fq k=25 mink=12 hdist=1 ktrim=r ref=truseq.fa.gz,nextera.fa.gz minlen=201 tpe tbo
                    
                    BBDuk version 34.86
                    Set threads to 4
                    maskMiddle was disabled because useShortKmers=true
                    Initial:
                    Memory: free=4030m, used=86m
                    
                    Added 161221 kmers; time: 	0.185 seconds.
                    Memory: free=3794m, used=322m
                    
                    Input is being processed as paired
                    Started output streams:	0.011 seconds.
                    Processing time:   		23.263 seconds.
                    
                    Input:                  	2967156 reads 		542374420 bases.
                    KTrimmed:               	1506815 reads (50.78%) 	184149438 bases (33.95%)
                    Trimmed by overlap:     	8290 reads (0.28%) 	273347 bases (0.05%)
                    Result:                 	1463616 reads (49.33%) 	357740553 bases (65.96%)
                    
                    Time:   			23.465 seconds.
                    Reads Processed:       2967k 	126.45k reads/sec
                    Bases Processed:        542m 	23.11m bases/sec
                    Depending on minlen, "KTrimmed" reports very different numbers, but adapter contamination should be identical, am I correct? Wouldn't make more sense report reads removed for being too short under "QTrimmed", or maybe something as "LTrimmed"?

                    Finally, a warning to those unaware, and on the rare situation of using a computer with IBM Java (such as SUSE Enterprise): BBDuk (and GATK, Maven and some others) break under this Java, one has to install Oracle Java or OpenJava in order for them to work.

                    thanks for the fantastic tools, nep

                    Comment


                    • BBDuk tracks which operation lead to particular bases being removed. So, if you have no minlength, a 250bp read trimmed via kmers to 190bp will yield 60bp ktrimmed; or, if it had no matching kmers but had low quality so that it was quality-trimmed to 190bp, it would yield 60bp qtrimmed.

                      But let's say you set "minlen=200". If the length drops below the threshold, all bases will be considered trimmed, and the entire read will be associated with the reason it was eliminated - so a low-quality read will increment qtrimmed by +250, while a read with adapter bases at position 190 would increment ktrimmed by +250. Furthermore, in the default mode (which can be overridden with the "rieb=f" [removeifeitherbad=false] flag), if either read is trimmed below the threshold, both reads are discarded - so if minlen=200 and read 2 was trimmed to 190bp, in a 2x250bp run, both would be discarded. So, 500bp would be added to either ktrimmed or qtrimmed, whichever triggered the removal.

                      I prefer this over having an "ltrimmed" line for bases that were removed because the reads were too short, because in that case, you can't tell what triggered the removal. Of course, each way has advantages - if you are only doing a single operation, "ltrimmed" would be preferable - but I usually run BBDuk with multiple operations per pass.

                      Comment


                      • Originally posted by Brian Bushnell View Post
                        I prefer this over having an "ltrimmed" line for bases that were removed because the reads were too short, because in that case, you can't tell what triggered the removal. Of course, each way has advantages - if you are only doing a single operation, "ltrimmed" would be preferable - but I usually run BBDuk with multiple operations per pass.
                        Thanks for the response, I understand your reasoning. I guess I could call bbduk twice to do a "two-pass" trimming, first trimming by minlen only, then adapters+minlen. As BBTools are so fast it does not cost much anyway.

                        By the way, is tbo the same method as described in Front Genet. 2014; 5: 5?

                        Comment


                        • Yes, it looks like the concept is the same, though the implementation is different as BBDuk uses BBMerge's overlapper code.

                          Comment


                          • Dear Brian,

                            i just started using BBduk and am really happy with it.

                            There is just a question concerning RNASeq from the NextSeq500.

                            Which setting would you recommend for quality trimming, as the Phred scores seem quite high independent of the sample quality ? I have been trying trimq 15 so far.

                            Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?

                            Many thanks for your help
                            Last edited by IonTom; 05-14-2015, 02:08 AM.

                            Comment


                            • Originally posted by IonTom View Post
                              Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?
                              repair.sh from BBMap can separate the unpaired reads.

                              Code:
                              $ repair.sh in1=read_1.q in2=read_2.fq out1=fixed_1.fq out2=fixed_2.fq outsingle=single.fq

                              Comment


                              • Originally posted by IonTom View Post
                                There is just a question concerning RNASeq from the NextSeq500.

                                Which setting would you recommend for quality trimming, as the Phred scores seem quite high independent of the sample quality ? I have been trying trimq 15 so far.
                                It is difficult to effectively quality-trim NextSeq data because the quality scores are both binned and inaccurate (for V1; V2 is still binned, but more accurate). For V1 chemistry, on the raw reads, I would recommend 15, and that should be adequate. But it depends on what you are doing with the data. BBMap, for example, does not generally need any quality-trimming unless the data is really terrible.

                                If you want optimal quality-trimming for NextSeq data, and are willing to spend a little more time, and you have a reference transcriptome (for RNA-seq), I would suggest first recalibrating the quality scores, like this:

                                Code:
                                bbmap.sh ref=transcriptome.fa in=reads.fq out=mapped.sam reads=8m maxindel=200
                                
                                calctruequality.sh in=mapped.sam
                                
                                bbduk.sh in=reads.fq out=trimmed.fq recal qtrim=r trimq=15
                                In this case the quality will be recalibrated by BBDuk prior to trimming. If you choose to recalibrate quality, I highly recommend using the latest version of BBTools (34.94) as it has had some extensive improvements for recalibration recently, specifically for NextSeq.

                                Also one other question. Does BBduk filter unpaired reads ? Or is there a way to do this ?

                                Many thanks for your help
                                For paired data, your input reads should be properly ordered prior to processing with BBDuk, and they will still be properly ordered after processing - pairs will always be kept together. In situations where one read is trimmed shorter than minlength and the other is not, both will be discarded. You can save the longer of the two reads by setting the "outs" stream, which will catch singletons.

                                If your pairing was broken prior to running BBDuk, then follow Genomax's suggestion and repair them first.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X