I have communicated with Tony Bolger about this bug. He gave permission to post on SeqAnswers.
Basically I was doing trimming on some miRNAs but, incorrectly, using the TruSeq adapters. This uncovered a bug (or feature) of Trimmomatic for which Tony needs some advice.
As an example, my original sequence is:
When I use the TruSeq adapter as follows:
Then with a mismatch of 2 and score of 15 I get as the trimmed sequence.
Which is correct but is only due to happenstance. Note that the TruSeq adapter does not having the TGGAAT... sequence where the trimming begins. However the adapter and the original sequence do have a long match starting at the GAACTCC... part.
I'll let Tony explain:
Basically I was doing trimming on some miRNAs but, incorrectly, using the TruSeq adapters. This uncovered a bug (or feature) of Trimmomatic for which Tony needs some advice.
As an example, my original sequence is:
Code:
>original_2989:2125 TCTCATTCCATACATCGTCTGATGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGTTTCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAA
Code:
>TruSeq_Adapter_Index_21-GTTTCG GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGATCTCGTATGCCGTCTTCTGCTTG
Code:
>truseq_trimmed TCTCATTCCATACATCGTCTGA
I'll let Tony explain:
OK - so what's happening is that, for some (arguably incorrect) reasons,
trimmomatic is considering the alignment sufficiently good despite the
relatively large number of mismatches in the first part of the adapter.
Normally, assuming mismatches are relatively reliable bases (i.e with
good Q scores) and distributed across the aligning region, they
massively reduce the alignment score, thus bringing it below the
threshold.
However, given your example, where all the mismatches towards one end of
the adapter, but the adapter itself is relatively long, the 'good match'
region is alone sufficient to pass the threshold - in effect, a local
alignment.
This is arguably incorrect for illumina clipping, since you would
normally expect at least the start of the adapter to be present, so
maybe it would make more sense to require the hit to consider all bases
from the start, thus causing early mismatches to have a more drastic
effect on the alignment score.
Another approach would be to leave it as a local alignment, but perform
the trimming from the start of the local alignment region, rather than
from the assumed start of adapter.
Another variant is to combine these - first consider from the start of
the adapter to the best aligning region, and trim if it's above the
threshold, otherwise try the local alignment with local trimming.
None of these changes are particularly difficult to make, so if you (and
others) have a strong preference on it, I can change it.
trimmomatic is considering the alignment sufficiently good despite the
relatively large number of mismatches in the first part of the adapter.
Normally, assuming mismatches are relatively reliable bases (i.e with
good Q scores) and distributed across the aligning region, they
massively reduce the alignment score, thus bringing it below the
threshold.
However, given your example, where all the mismatches towards one end of
the adapter, but the adapter itself is relatively long, the 'good match'
region is alone sufficient to pass the threshold - in effect, a local
alignment.
This is arguably incorrect for illumina clipping, since you would
normally expect at least the start of the adapter to be present, so
maybe it would make more sense to require the hit to consider all bases
from the start, thus causing early mismatches to have a more drastic
effect on the alignment score.
Another approach would be to leave it as a local alignment, but perform
the trimming from the start of the local alignment region, rather than
from the assumed start of adapter.
Another variant is to combine these - first consider from the start of
the adapter to the best aligning region, and trim if it's above the
threshold, otherwise try the local alignment with local trimming.
None of these changes are particularly difficult to make, so if you (and
others) have a strong preference on it, I can change it.
Comment