Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adapter_and_kmer_trimming

    Hi everyone,

    I am using publicly available, 51 bp paired-end RNA seq data and I have some questions concerning the quality trimming of the data before passing them to Tophat2 for mapping.

    Specifically I do not know which adapters were used, so I used fastqc and then trim_galore to remove the default illumina adapter "AGATCGGAAGAGC" and one overrepresented sequence "CTTTGTGTTTGATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT".

    It is really important to remove as much adapter contamination as possible because my analysis has to do with discovering variations that may correspond to RNA editing, rather than studying gene expression.

    So my questions are:

    1) I am still getting 3 kmers in numbers ranging from 500 to 1800 that can be found within illumina adapters, and are reported to be in the middle of the read's length (positions 20, 34 and 41). Each one is found in a different adapter and an RNA PCR Index primer.

    Should I use trim_galore to remove these kmers from my reads?

    kmers in fastqc found in illumina adapters, marked in red boxes:

    Click image for larger version

Name:	SRR1524292_1_val_1_fastqc_kmers.png
Views:	1
Size:	115.3 KB
ID:	308961

    2) I have already performed removal with trim_galore for these kmers and trimming to improve Per base sequence content.

    However the kmer GTACGTA appears in my fastqc report, and this kmer can be found in the TruSeq Adapter, Index 22. This adapter begins with GATCGGAAGAGC and should have been removed during the first step of trim_galore --illumina.

    Should this kmer be removed as well?

    Generally is it possible for kmers to be found within illumina adapters by chance?

    3) After applying trim_galore --illumina the Per base sequence content of the 3 prime end of the reads is starting to show divergence, which gets worse every time I remove a sequence.

    Is this because of the different length of the reads because some are trimmed more than others? (read length 51 to read length 20-51)

    Should I trim the 3 prime end of the reads in this case?

    Data before trimming: Click image for larger version

Name:	Per_base_sequence_content_SRR1524292_1.png
Views:	1
Size:	35.7 KB
ID:	308962

    Data after trim_galore --illumina: Click image for larger version

Name:	Per_base_sequence_content_SRR1524292_1_after_trim_galore--illumina.png
Views:	1
Size:	31.6 KB
ID:	308963

    Data after trim_galore a)--illumina, b)kmers and c)overrexpressed sequence: Click image for larger version

Name:	Per_base_sequence_content_after_removing_--illumina_kmers_overrepresented_seq.png
Views:	1
Size:	41.5 KB
ID:	308964

    4) My last question is: Is tophat2 going to have a problem in alighning paired end reads with length ranging from 20 to 32?

  • #2
    I would recommend trying bbduk from BBMap. @Brian includes all common adapters you are likely to run into and they are included in the "resources" directory in BBMap download and will be scanned at the same time without you having to provide them ad hoc.

    I would not worry about the kmers (unless you see an issue after alignment) since they may be real part of the data.

    While you use TopHat, go ahead and try BBMap (it is splice aware) as an alternate aligner.

    Comment


    • #3
      Thank you GenoMax, I will be sure to check out BBMap.

      I would not worry about the kmers (unless you see an issue after alignment) since they may be real part of the data.
      So your opinion is that the presence of at least some of these kmers in illumina adapters is random? Or that it will pose no problem?

      Comment


      • #4
        TrimGalore appears to be overly aggressive in trimming the ends of the reads, by trimming down to only a few bp match at the very end, or something similar. BBDuk's recommended default of "mink=11" avoids this by using a minimum of an 11bp sequence match at the end. The histogram of the raw data showed no evidence of adapter contamination, but I still recommend trimming, since there's always some. Just, not with such aggressive settings, as they will introduce bias.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X