Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need some clarification regarding adapter trimming

    Dear Bioinformaticians,

    I'm sorry to admit that after so many posts I've seen regarding the topic, I still have some doubts on how to do it correctly.

    I have sequenced (using Illumina Miseq) some monocellular parasite genomes, planning to map the data to reference with BWA mem, remove duplicates with Picard, finally do SNP calling.
    As first step, I have done some quality control using FastQC, and found out that some of the samples have adapter contamination up to ~ 0.14%. I believe this is due to some fragments being too short and so the machine read over the insert. I was planning to use Trimmomatic 0.33 for adapter trimming, but I have noticed that the sequences in the TruSeq3-PE-2.fa file are 34 nucleotides long, while in the FastQC report the segments are 50 nt, including the index in some cases .

    I wonder if it would be more correct to create a file with the specific 50 nt sequences to use as a guide to trim adapters, or should add them to the already existing fasta file.
    What is probably confusing me is that I initially thought the insert would be at the end of each affected read in the fastq file, while it is actually at the beginning of the read. Can you please advice me?

    Thanks for your help,
    Max

  • #2
    Illumina adapters contain a "core" sequence that is common. Most trimming programs will look for matches to this sequence then trim the rest of the read based on that match.

    I will recommend that you take a look at BBMap suite. You will find BBduk (scan/trimmer) and BBMap (aligner) very easy to use. @Brian includes sequences of all common adapters (adapters.fa) in the "resources" directory of BBMap program.

    Code:
    >TruSeq_Adapter_Index_2
    [COLOR="Yellow"]GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[/COLOR]CGATGTATCTCGTATGCCGTCTTCTGCTTG
    >TruSeq_Adapter_Index_3
    [COLOR="Yellow"]GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[/COLOR]TTAGGCATCTCGTATGCCGTCTTCTGCTTG
    Last edited by GenoMax; 06-02-2016, 07:15 AM.

    Comment


    • #3
      Thank you, GenoMax. I just downloaded BBMap; the adapters.fa has actually the adapters with the individual indexes attached. I will use it on my data.

      I am still puzzled regarding the adapter trimming step in general though. I tried trimmomatic on a very small subset (only one sequence had the adapter read through). I used extremely low quality and length cutoffs, to make sure only adapter trimming would be done. The read pair with adapter was just dropped, not trimmed. Is it normal?

      I still do not understand why the adapter is at the beginning of the read rather than at the end. Looking at the Illumina videos, I have the impression reads are actually written right to left, but I may be wrong.

      Comment


      • #4
        Illumina reads are processed left-to-right. If adapter sequence is present at the beginning (left end) of the read, the read pair is an adapter-dimer with no genetic sequence at all, and should be discarded. If adapter sequence is present somewhere else in the read, the portion to the left of the adapter sequence should be retained, and the rest should be trimmed.

        Comment


        • #5
          Thank you, Brian_Bushnell, this makes sense. I am trying to get a deeper understanding of what I am doing, rather than using the programs as black boxes.
          So, basically, if I want to trim the adapter from read-throughs, I don't really need to add individual indexes (i.e. the barcodes I used for multiplexing). This is because anything on the right side of the insert sequence will be trimmed off anyway. Am I right?

          Comment


          • #6
            That is correct.

            If you have adapters to the left (beginning of the read) of your real sequence then there is something wrong. If you have paired end data remember to trim both reads together.

            (i.e. the barcodes I used for multiplexing)
            Edit: Just want to make sure you are not referring to inline barcodes in the last post? Illumina barcodes/tags are read separately and are never present in the actual read (R1/R2).
            Last edited by GenoMax; 06-02-2016, 10:33 AM.

            Comment


            • #7
              Great, thank you. These sequences are about 0.1% of my data, they all seem to have the adapter at the beginning. I'll verify that they are dropped from my dataset when I trim it.

              Edit: GenoMax: I am actually referring to the Illumina barcodes. The affected reads in my R1 file (forward reads) have the insert+index at the beginning of the sequence. FastQC, in the overrepresented sequences section, correctly identifies the segment as "TruSeq Adapter, Index 4".
              The R2 file overrepresented sequence (still at the beginning of the reads) is identified as "Illumina Single End PCR Primer 1"; the index does not seem to be part of the segment.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X