![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Introducing BBDuk: Adapter/Quality Trimming and Filtering | Brian Bushnell | Bioinformatics | 335 | 10-29-2020 03:23 AM |
Adapter trimming | figo1019 | RNA Sequencing | 2 | 07-17-2018 05:00 AM |
Adapter trimming and trimming by quality question | alisrpp | Bioinformatics | 5 | 04-08-2013 05:55 PM |
adapter trimming - help | a_mt | Bioinformatics | 6 | 11-12-2012 08:36 PM |
3' Adapter Trimming | caddymob | Bioinformatics | 0 | 05-27-2009 01:53 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Chesterfield, MO Join Date: Aug 2015
Posts: 52
|
![]()
Hello,
I'm using BBduk to trim adapter sequences from my reads. From the help manual I see the editdistance and hammingdistance options, but they set a fixed number of mismatches, independent of the length of the match. Is it possible to specify an error rate (like 10%) that allows errors in the match depending on the length of the match? This feature is available in cutadapt, but I'd rather do it with bbduk if possible to shorten my pipeline. |
![]() |
![]() |
![]() |
#2 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
For Illumina reads, you don't need to worry about the edit distance, just hamming distance. BBDuk supports a variable hamming distance and variable kmer length, so the error rate would be the kmer length divided by number of mismatches. So, k=25 with hdist=2 would allow an 8% error rate. As would edist=2, if you need to allow edits.
With the "tbo" flag (for paired reads) the effect error rate allowed in the adapter portion is much higher (100%, technically), and with the "tpe" flag again a 100% error rate is allowed in 1 of 2 paired reads, so the error rate settings are not directly comparable since the methods are different. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Chesterfield, MO Join Date: Aug 2015
Posts: 52
|
![]()
Thanks Brian. I have a question about edit distance: if I allow say editdistance=2, would it catch potentially informative bases after the found adapter? For example, if I have the adapter "ADAPTER" and the sequences:
ADAPTERactg ADAPTcERactg cADtAPTERactg would they become: tg ctg actg or does the edit distance only apply before the last base in the adapter? As you can tell, in the first and second examples, we lost part of the regular sequence. The last example is what I'd like to use the edit distance for. |
![]() |
![]() |
![]() |
#4 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Assuming you are doing left-trimming, then in the above example, yes, those results are correct. You would get tg, ctg, and actg respectively.
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Chesterfield, MO Join Date: Aug 2015
Posts: 52
|
![]()
Thanks for the help.
|
![]() |
![]() |
![]() |
#6 |
Junior Member
Location: Philadelphia Join Date: Jun 2013
Posts: 5
|
![]()
I have a question regarding adapter trimming and whether you have further recommendations for adapter removal?
I have truseq indexed adapters and MiSeq reads (2*300) I used ktrim=r k=21 mink=11 hdist=2 tpe tbo with the following results. Added 5659609 kmers; time: 2.105 seconds. Memory: max=115011m, free=104210m, used=10801m Input is being processed as paired Started output streams: 0.022 seconds. Processing time: 30.272 seconds. Input: 5927766 reads 1784257566 bases. KTrimmed: 1190226 reads (20.08%) 56818182 bases (3.18%) Trimmed by overlap: 47342 reads (0.80%) 301726 bases (0.02%) Result: 5927544 reads (100.00%) 1727137658 bases (96.80%) Fastqc denotes that adapters have been removed, but kmer content is still fairly high. If I set k=15, approximately 50% of the reads are trimmed. This seems more appropriate, but I am also concerned that I am removing too much. Any thoughts? Thanks in advance~ |
![]() |
![]() |
![]() |
#7 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
What kind of library/experiment is this? Also, what does the quality profile look like? Posting the FastQC results would be helpful, for example. Running BBMerge for an insert size histogram could also be useful, to see how many adapter sequences you should expect:
Code:
bbmerge.sh reads=1m in1=r1.fq in2=r2.fq ihist=ihist.txt vloose mininsert=15 outa=adapters.fa 2x300 runs often have extremely low quality, particularly in low-diversity libraries like 16S amplicons. It's harder to detect adapters with a very high error rate, so you may need more aggressive settings, but K=15 and hdist=2 with the full set of Illumina adapter sequences will probably get you a lot of false positives. |
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: Philadelphia Join Date: Jun 2013
Posts: 5
|
![]()
Hi,
Thanks for the quick reply. This is a denovo whole genome sequencing project of an invertebrate run with TruSeq adapters and a MiSeq Reagant kit 3. This species was multiplexed with 8 others. The indexed barcode for this species is TGACCA Here is the outcome of the bbmerge Pairs: 1000000 Joined: 539856 53.986% Ambiguous: 419335 41.934% No Solution: 40700 4.070% Too Short: 109 0.011% Avg Insert: 329.2 Standard Deviation: 87.7 Mode: 317 Insert range: 15 - 595 90th percentile: 441 75th percentile: 384 50th percentile: 326 25th percentile: 275 10th percentile: 233 Adapters >Read1_adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAANCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAA >Read2_adapter N Ihist is attached as well as QC output before trimming and adapter removing. Please let me know if I can provide anything else. |
![]() |
![]() |
![]() |
#9 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Oh, that's not good. You have a lot of adapter sequence, but unfortunately the read quality dropped to zero at the end and is generally terrible. That will make the data very hard to use for a quality assembly. You'll definitely need to do extensive quality-trimming.
If this data was generated for by a paid facility, they should replace it for you at no cost; and if it was generated internally, Illumina should replace the reagents at no cost, because it fails their specifications (assuming nothing went wrong in library creation). If you really need to use it, though, try quality-trimming to Q15 or so (qtrim=rl trimq=15). You can do more aggressive adapter trimming if you want, with say "hdist=3 hdist2=2" to yield an overall command of: Code:
bbduk.sh (files) ktrim=r k=21 mink=11 hdist=3 tpe tbo qtrim=r trimq=15 Most of the adapter sequences will be removed via quality-trimming anyway, though; the reason they are not recognized is because the quality is low, so after the low-quality sequence is removed, only high-quality adapters would be present, which would be recognized. |
![]() |
![]() |
![]() |
#10 |
Junior Member
Location: Philadelphia Join Date: Jun 2013
Posts: 5
|
![]()
Thanks for the information. The average quality score of Q>30 for each sample was 59-70%. I have seen much worse than these data, so I thought this was OK! Thanks for the insight, I will definitely follow up.
|
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: Philadelphia Join Date: Jun 2013
Posts: 5
|
![]()
Oh-And would you recommend quality trimming before adapter removal?
|
![]() |
![]() |
![]() |
#12 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
At the same time. When you tell BBDuk to do both in one pass, it does quality-trimming after adapter-trimming, which is optimal. If you do quality-trimming first (with left-trim, as in qtrim=rl) you can't use the tbo and tpe flags anymore because their assumptions are violated.
|
![]() |
![]() |
![]() |
Tags |
adapter, bbduk, trimming |
Thread Tools | |
|
|