Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimmomatic -- Tony needs advice

    I have communicated with Tony Bolger about this bug. He gave permission to post on SeqAnswers.

    Basically I was doing trimming on some miRNAs but, incorrectly, using the TruSeq adapters. This uncovered a bug (or feature) of Trimmomatic for which Tony needs some advice.

    As an example, my original sequence is:
    Code:
    >original_2989:2125
    TCTCATTCCATACATCGTCTGATGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGTTTCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAA
    When I use the TruSeq adapter as follows:
    Code:
    >TruSeq_Adapter_Index_21-GTTTCG
    GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGATCTCGTATGCCGTCTTCTGCTTG
    Then with a mismatch of 2 and score of 15 I get as the trimmed sequence.
    Code:
    >truseq_trimmed
    TCTCATTCCATACATCGTCTGA
    Which is correct but is only due to happenstance. Note that the TruSeq adapter does not having the TGGAAT... sequence where the trimming begins. However the adapter and the original sequence do have a long match starting at the GAACTCC... part.

    I'll let Tony explain:

    OK - so what's happening is that, for some (arguably incorrect) reasons,
    trimmomatic is considering the alignment sufficiently good despite the
    relatively large number of mismatches in the first part of the adapter.

    Normally, assuming mismatches are relatively reliable bases (i.e with
    good Q scores) and distributed across the aligning region, they
    massively reduce the alignment score, thus bringing it below the
    threshold.

    However, given your example, where all the mismatches towards one end of
    the adapter, but the adapter itself is relatively long, the 'good match'
    region is alone sufficient to pass the threshold - in effect, a local
    alignment.

    This is arguably incorrect for illumina clipping, since you would
    normally expect at least the start of the adapter to be present, so
    maybe it would make more sense to require the hit to consider all bases
    from the start, thus causing early mismatches to have a more drastic
    effect on the alignment score.

    Another approach would be to leave it as a local alignment, but perform
    the trimming from the start of the local alignment region, rather than
    from the assumed start of adapter.

    Another variant is to combine these - first consider from the start of
    the adapter to the best aligning region, and trim if it's above the
    threshold, otherwise try the local alignment with local trimming.

    None of these changes are particularly difficult to make, so if you (and
    others) have a strong preference on it, I can change it.

  • #2
    Well, no one has responded (at least publicly) to Tony's request on how to change the Trimmomatic program. So in order to get my thoughts on record and in order to get this post back to the fore, here is my idea.

    All three methods are good choices and making the selection of the method into a command line parameter would be useful. However if I were to choose one method it would be for a local alignment -- that is, only trim once the adapter bases actually match the bases being trimmed.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin


      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
      Yesterday, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    39 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    41 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    35 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Working...
    X