Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do you specify error rate in BBduk adapter trimming?

    Hello,

    I'm using BBduk to trim adapter sequences from my reads. From the help manual I see the editdistance and hammingdistance options, but they set a fixed number of mismatches, independent of the length of the match. Is it possible to specify an error rate (like 10%) that allows errors in the match depending on the length of the match?

    This feature is available in cutadapt, but I'd rather do it with bbduk if possible to shorten my pipeline.

  • #2
    For Illumina reads, you don't need to worry about the edit distance, just hamming distance. BBDuk supports a variable hamming distance and variable kmer length, so the error rate would be the kmer length divided by number of mismatches. So, k=25 with hdist=2 would allow an 8% error rate. As would edist=2, if you need to allow edits.

    With the "tbo" flag (for paired reads) the effect error rate allowed in the adapter portion is much higher (100%, technically), and with the "tpe" flag again a 100% error rate is allowed in 1 of 2 paired reads, so the error rate settings are not directly comparable since the methods are different.

    Comment


    • #3
      Thanks Brian. I have a question about edit distance: if I allow say editdistance=2, would it catch potentially informative bases after the found adapter? For example, if I have the adapter "ADAPTER" and the sequences:

      ADAPTERactg
      ADAPTcERactg
      cADtAPTERactg

      would they become:

      tg
      ctg
      actg

      or does the edit distance only apply before the last base in the adapter? As you can tell, in the first and second examples, we lost part of the regular sequence. The last example is what I'd like to use the edit distance for.

      Comment


      • #4
        Assuming you are doing left-trimming, then in the above example, yes, those results are correct. You would get tg, ctg, and actg respectively.

        Comment


        • #5
          Thanks for the help.

          Comment


          • #6
            I have a question regarding adapter trimming and whether you have further recommendations for adapter removal?

            I have truseq indexed adapters and MiSeq reads (2*300)

            I used ktrim=r k=21 mink=11 hdist=2 tpe tbo with the following results.

            Added 5659609 kmers; time: 2.105 seconds.
            Memory: max=115011m, free=104210m, used=10801m

            Input is being processed as paired
            Started output streams: 0.022 seconds.
            Processing time: 30.272 seconds.

            Input: 5927766 reads 1784257566 bases.
            KTrimmed: 1190226 reads (20.08%) 56818182 bases (3.18%)
            Trimmed by overlap: 47342 reads (0.80%) 301726 bases (0.02%)
            Result: 5927544 reads (100.00%) 1727137658 bases (96.80%)

            Fastqc denotes that adapters have been removed, but kmer content is still fairly high. If I set k=15, approximately 50% of the reads are trimmed. This seems more appropriate, but I am also concerned that I am removing too much. Any thoughts?

            Thanks in advance~

            Comment


            • #7
              What kind of library/experiment is this? Also, what does the quality profile look like? Posting the FastQC results would be helpful, for example. Running BBMerge for an insert size histogram could also be useful, to see how many adapter sequences you should expect:
              Code:
              bbmerge.sh reads=1m in1=r1.fq in2=r2.fq ihist=ihist.txt vloose mininsert=15 outa=adapters.fa
              Only pairs with insert size shorter than 300bp should contain adapters. If you run that, please post the screen output and both of the output files (adapters.fa and ihist.txt).

              2x300 runs often have extremely low quality, particularly in low-diversity libraries like 16S amplicons. It's harder to detect adapters with a very high error rate, so you may need more aggressive settings, but K=15 and hdist=2 with the full set of Illumina adapter sequences will probably get you a lot of false positives.

              Comment


              • #8
                Hi,

                Thanks for the quick reply. This is a denovo whole genome sequencing project of an invertebrate run with TruSeq adapters and a MiSeq Reagant kit 3. This species was multiplexed with 8 others. The indexed barcode for this species is TGACCA

                Here is the outcome of the bbmerge
                Pairs: 1000000
                Joined: 539856 53.986%
                Ambiguous: 419335 41.934%
                No Solution: 40700 4.070%
                Too Short: 109 0.011%

                Avg Insert: 329.2
                Standard Deviation: 87.7
                Mode: 317

                Insert range: 15 - 595
                90th percentile: 441
                75th percentile: 384
                50th percentile: 326
                25th percentile: 275
                10th percentile: 233

                Adapters
                >Read1_adapter
                AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAANCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAA
                >Read2_adapter
                N

                Ihist is attached as well as QC output before trimming and adapter removing. Please let me know if I can provide anything else.
                Attached Files

                Comment


                • #9
                  Oh, that's not good. You have a lot of adapter sequence, but unfortunately the read quality dropped to zero at the end and is generally terrible. That will make the data very hard to use for a quality assembly. You'll definitely need to do extensive quality-trimming.

                  If this data was generated for by a paid facility, they should replace it for you at no cost; and if it was generated internally, Illumina should replace the reagents at no cost, because it fails their specifications (assuming nothing went wrong in library creation). If you really need to use it, though, try quality-trimming to Q15 or so (qtrim=rl trimq=15). You can do more aggressive adapter trimming if you want, with say "hdist=3 hdist2=2" to yield an overall command of:

                  Code:
                  bbduk.sh (files) ktrim=r k=21 mink=11 hdist=3 tpe tbo qtrim=r trimq=15
                  For the adapter sequences, I suggest you download the latest version of BBTools (36.14) and run BBMerge again; the version you are using is a little older and has a bug that was preventing read 2's adapter-sequence from being determined. Using the actual adapter sequences of your reads is better for specificity than using all of the possible adapter sequences that come bundled with BBMap. Normally this does not matter, but when you go all the way to hdist=3 specificity becomes more important.

                  Most of the adapter sequences will be removed via quality-trimming anyway, though; the reason they are not recognized is because the quality is low, so after the low-quality sequence is removed, only high-quality adapters would be present, which would be recognized.

                  Comment


                  • #10
                    Thanks for the information. The average quality score of Q>30 for each sample was 59-70%. I have seen much worse than these data, so I thought this was OK! Thanks for the insight, I will definitely follow up.

                    Comment


                    • #11
                      Oh-And would you recommend quality trimming before adapter removal?

                      Comment


                      • #12
                        At the same time. When you tell BBDuk to do both in one pass, it does quality-trimming after adapter-trimming, which is optimal. If you do quality-trimming first (with left-trim, as in qtrim=rl) you can't use the tbo and tpe flags anymore because their assumptions are violated.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        9 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X