SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBDuk: Adapter/Quality Trimming and Filtering Brian Bushnell Bioinformatics 308 11-13-2018 07:19 AM
Adapter trimming figo1019 RNA Sequencing 2 07-17-2018 05:00 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 05:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 08:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM

Reply
 
Thread Tools
Old 12-13-2015, 08:04 PM   #1
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default How do you specify error rate in BBduk adapter trimming?

Hello,

I'm using BBduk to trim adapter sequences from my reads. From the help manual I see the editdistance and hammingdistance options, but they set a fixed number of mismatches, independent of the length of the match. Is it possible to specify an error rate (like 10%) that allows errors in the match depending on the length of the match?

This feature is available in cutadapt, but I'd rather do it with bbduk if possible to shorten my pipeline.
antifolate is offline   Reply With Quote
Old 12-14-2015, 11:22 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

For Illumina reads, you don't need to worry about the edit distance, just hamming distance. BBDuk supports a variable hamming distance and variable kmer length, so the error rate would be the kmer length divided by number of mismatches. So, k=25 with hdist=2 would allow an 8% error rate. As would edist=2, if you need to allow edits.

With the "tbo" flag (for paired reads) the effect error rate allowed in the adapter portion is much higher (100%, technically), and with the "tpe" flag again a 100% error rate is allowed in 1 of 2 paired reads, so the error rate settings are not directly comparable since the methods are different.
Brian Bushnell is offline   Reply With Quote
Old 12-14-2015, 01:01 PM   #3
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default

Thanks Brian. I have a question about edit distance: if I allow say editdistance=2, would it catch potentially informative bases after the found adapter? For example, if I have the adapter "ADAPTER" and the sequences:

ADAPTERactg
ADAPTcERactg
cADtAPTERactg

would they become:

tg
ctg
actg

or does the edit distance only apply before the last base in the adapter? As you can tell, in the first and second examples, we lost part of the regular sequence. The last example is what I'd like to use the edit distance for.
antifolate is offline   Reply With Quote
Old 12-14-2015, 01:31 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Assuming you are doing left-trimming, then in the above example, yes, those results are correct. You would get tg, ctg, and actg respectively.
Brian Bushnell is offline   Reply With Quote
Old 12-14-2015, 01:32 PM   #5
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default

Thanks for the help.
antifolate is offline   Reply With Quote
Old 07-07-2016, 11:37 AM   #6
quattrinia
Junior Member
 
Location: Philadelphia

Join Date: Jun 2013
Posts: 5
Default

I have a question regarding adapter trimming and whether you have further recommendations for adapter removal?

I have truseq indexed adapters and MiSeq reads (2*300)

I used ktrim=r k=21 mink=11 hdist=2 tpe tbo with the following results.

Added 5659609 kmers; time: 2.105 seconds.
Memory: max=115011m, free=104210m, used=10801m

Input is being processed as paired
Started output streams: 0.022 seconds.
Processing time: 30.272 seconds.

Input: 5927766 reads 1784257566 bases.
KTrimmed: 1190226 reads (20.08%) 56818182 bases (3.18%)
Trimmed by overlap: 47342 reads (0.80%) 301726 bases (0.02%)
Result: 5927544 reads (100.00%) 1727137658 bases (96.80%)

Fastqc denotes that adapters have been removed, but kmer content is still fairly high. If I set k=15, approximately 50% of the reads are trimmed. This seems more appropriate, but I am also concerned that I am removing too much. Any thoughts?

Thanks in advance~
quattrinia is offline   Reply With Quote
Old 07-07-2016, 11:45 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

What kind of library/experiment is this? Also, what does the quality profile look like? Posting the FastQC results would be helpful, for example. Running BBMerge for an insert size histogram could also be useful, to see how many adapter sequences you should expect:
Code:
bbmerge.sh reads=1m in1=r1.fq in2=r2.fq ihist=ihist.txt vloose mininsert=15 outa=adapters.fa
Only pairs with insert size shorter than 300bp should contain adapters. If you run that, please post the screen output and both of the output files (adapters.fa and ihist.txt).

2x300 runs often have extremely low quality, particularly in low-diversity libraries like 16S amplicons. It's harder to detect adapters with a very high error rate, so you may need more aggressive settings, but K=15 and hdist=2 with the full set of Illumina adapter sequences will probably get you a lot of false positives.
Brian Bushnell is offline   Reply With Quote
Old 07-07-2016, 02:52 PM   #8
quattrinia
Junior Member
 
Location: Philadelphia

Join Date: Jun 2013
Posts: 5
Default

Hi,

Thanks for the quick reply. This is a denovo whole genome sequencing project of an invertebrate run with TruSeq adapters and a MiSeq Reagant kit 3. This species was multiplexed with 8 others. The indexed barcode for this species is TGACCA

Here is the outcome of the bbmerge
Pairs: 1000000
Joined: 539856 53.986%
Ambiguous: 419335 41.934%
No Solution: 40700 4.070%
Too Short: 109 0.011%

Avg Insert: 329.2
Standard Deviation: 87.7
Mode: 317

Insert range: 15 - 595
90th percentile: 441
75th percentile: 384
50th percentile: 326
25th percentile: 275
10th percentile: 233

Adapters
>Read1_adapter
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAANCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAA
>Read2_adapter
N

Ihist is attached as well as QC output before trimming and adapter removing. Please let me know if I can provide anything else.
Attached Images
File Type: png kmer_profiles.png (50.2 KB, 6 views)
File Type: png adapter_content.png (11.5 KB, 6 views)
File Type: png per_base_quality_R2.png (13.6 KB, 12 views)
File Type: png per_base_quality.png (12.5 KB, 8 views)
Attached Files
File Type: txt cer_ihist.txt (4.7 KB, 4 views)
quattrinia is offline   Reply With Quote
Old 07-07-2016, 03:19 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh, that's not good. You have a lot of adapter sequence, but unfortunately the read quality dropped to zero at the end and is generally terrible. That will make the data very hard to use for a quality assembly. You'll definitely need to do extensive quality-trimming.

If this data was generated for by a paid facility, they should replace it for you at no cost; and if it was generated internally, Illumina should replace the reagents at no cost, because it fails their specifications (assuming nothing went wrong in library creation). If you really need to use it, though, try quality-trimming to Q15 or so (qtrim=rl trimq=15). You can do more aggressive adapter trimming if you want, with say "hdist=3 hdist2=2" to yield an overall command of:

Code:
bbduk.sh (files) ktrim=r k=21 mink=11 hdist=3 tpe tbo qtrim=r trimq=15
For the adapter sequences, I suggest you download the latest version of BBTools (36.14) and run BBMerge again; the version you are using is a little older and has a bug that was preventing read 2's adapter-sequence from being determined. Using the actual adapter sequences of your reads is better for specificity than using all of the possible adapter sequences that come bundled with BBMap. Normally this does not matter, but when you go all the way to hdist=3 specificity becomes more important.

Most of the adapter sequences will be removed via quality-trimming anyway, though; the reason they are not recognized is because the quality is low, so after the low-quality sequence is removed, only high-quality adapters would be present, which would be recognized.
Brian Bushnell is offline   Reply With Quote
Old 07-07-2016, 03:38 PM   #10
quattrinia
Junior Member
 
Location: Philadelphia

Join Date: Jun 2013
Posts: 5
Default

Thanks for the information. The average quality score of Q>30 for each sample was 59-70%. I have seen much worse than these data, so I thought this was OK! Thanks for the insight, I will definitely follow up.
quattrinia is offline   Reply With Quote
Old 07-07-2016, 03:56 PM   #11
quattrinia
Junior Member
 
Location: Philadelphia

Join Date: Jun 2013
Posts: 5
Default

Oh-And would you recommend quality trimming before adapter removal?
quattrinia is offline   Reply With Quote
Old 07-07-2016, 04:00 PM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

At the same time. When you tell BBDuk to do both in one pass, it does quality-trimming after adapter-trimming, which is optimal. If you do quality-trimming first (with left-trim, as in qtrim=rl) you can't use the tbo and tpe flags anymore because their assumptions are violated.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
adapter, bbduk, trimming

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO