Hi,
I'm getting up to speed with Marcel Martin's cutadapt for removing adapter sequences from Illumina libraries.
I'd appreciate some expert input on what values might be reasonable for the -O (--overlap) parameter. As explained in the excellent help pages, this parameter defines the minumum overlap between the input adapter sequence and read sequences:
Code:
-O LENGTH, --overlap=LENGTH
Minimum overlap length. If the overlap between the read and the adapter is
shorter than LENGTH, the read is not modified.This reduces the no. of bases
trimmed purely due to short random adapter matches (default: 3).
My initial run has used the default. I get trim data like this:
Code:
Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC', length 33, was trimmed 205113 times.
Histogram of adapter lengths
length count
3 95613
4 12912
5 5869
6 4173
7 3809
8 3323
9 3183
10 2934
11 2938
12 2464
13 2213
14 2041
15 1803
16 1650
17 1489
18 1334
19 1269
20 1147
21 1031
22 909
23 859
24 819
25 701
26 656
27 536
28 528
29 488
30 386
31 368
32 351
33 47317
I can see this is not sensible.
With length=3 I get 95613 reads removed from my library; a good proportion of these must be spurious (i.e. by chance).
With length=33 I get 47317 reads removed. These have a better chance of not being spurious.
Somewhere between the boundaries here (3 .. 33) there must be a 'sensible' value for --offset. How do I identify it? Using what rationale? I could just opt for 33, the length of the adapter. But that would discount the probability that those of length 32 are also genuine (this library has an abrupt shift from 32 with 351 reads to 33 with the 47317, but not all my libraries look like this).
How might I go about this??
TIA
mgg