Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
cutadapt: A tool that removes adapter sequences mmartin Bioinformatics 132 05-19-2016 07:09 AM
Need some suggestion for overlap assembler ljhwahaha Bioinformatics 5 03-25-2014 07:27 AM
Overlap Graph navin_elango Bioinformatics 2 05-20-2013 10:56 AM
Fragment an assembly for later overlap k-gun12 Bioinformatics 2 05-27-2011 11:16 AM
Rationale for Superscript II RT use in Illumina 1st strand synthesis rodr08 Sample Prep / Library Generation 3 05-03-2010 10:36 AM

Thread Tools
Old 12-29-2011, 04:28 AM   #1
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default cutadapt: guidance on rationale for --overlap=LENGTH values


I'm getting up to speed with Marcel Martin's cutadapt for removing adapter sequences from Illumina libraries.

I'd appreciate some expert input on what values might be reasonable for the -O (--overlap) parameter. As explained in the excellent help pages, this parameter defines the minumum overlap between the input adapter sequence and read sequences:
  -O LENGTH, --overlap=LENGTH
       Minimum overlap length. If the overlap between the read and the adapter is
shorter than LENGTH, the read is not modified.This reduces the no. of bases
 trimmed purely due to short random adapter matches (default: 3).
My initial run has used the default. I get trim data like this:

Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC', length 33, was trimmed 205113 times.

Histogram of adapter lengths
length  count
3       95613
4       12912
5       5869
6       4173
7       3809
8       3323
9       3183
10      2934
11      2938
12      2464
13      2213
14      2041
15      1803
16      1650
17      1489
18      1334
19      1269
20      1147
21      1031
22      909
23      859
24      819
25      701
26      656
27      536
28      528
29      488
30      386
31      368
32      351
33      47317
I can see this is not sensible.

With length=3 I get 95613 reads removed from my library; a good proportion of these must be spurious (i.e. by chance).
With length=33 I get 47317 reads removed. These have a better chance of not being spurious.

Somewhere between the boundaries here (3 .. 33) there must be a 'sensible' value for --offset. How do I identify it? Using what rationale? I could just opt for 33, the length of the adapter. But that would discount the probability that those of length 32 are also genuine (this library has an abrupt shift from 32 with 351 reads to 33 with the 47317, but not all my libraries look like this).

How might I go about this??

mgg is offline   Reply With Quote
Old 01-05-2012, 09:23 AM   #2
Location: Stockholm

Join Date: Aug 2009
Posts: 75

My strategy so far was to not worry too much about the bases that get lost due to random matches. It depends on your data, but although 94613 looks large, you lose “only” 95613x3 bp, which may not be that bad.

However, the “count” column in your histogram decreases montonically from length 3 to 32. This is different from what I see in my data. One explanation is that your adapter almost never appears partially – it's either fully there or not at all and all matches from length 3 to 32 are, in fact, spurious. In that case, you can safely set --overlap to 33.

I'll probably change the output that cutadapt prints to make this all a bit clearer. Perhaps helpful would be print the number of bases removed and to give an estimate of how many of those were removed due to chance alone.
mmartin is offline   Reply With Quote

cutadapt, illumina, overlap

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 01:12 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO