Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • SEQnovice
    Junior Member
    • Nov 2012
    • 6

    Confusion regarding Illumina Adapter Trimming!

    Dear Experts,
    Please accept my apologies if this has been posted elsewhere. I am new to the analysis of RNA-seq data, and I am confused regarding trimming of my adapters from the FASTQ files using cutadapt. I have read through some of the posts but they have gotten me more confused!
    The details of my RNA-seq data are as follows:

    - The platform is Illumina, TruSeq
    - The FASTQ files are pair-ended (so I have an R1.fastq and R2.fastq for each of my samples). It is unknown which of the R1 and R2 represent the 'forward' or 'reverse' reads.
    - The files have been demultiplexed, so I have a barcode per sample which matches a specific barcode in a corresponding indexed adapter.
    - I have been provided with a Universal adapter and 5'-3' indexed adapters. I have checked the indexed adapters and they are all exactly identical except at the 6bp barcode in the middle of the sequence.

    Please kindly help me with the following:

    1. I am still trying to understand how Illumina TruSeq works but on principle, should the trimming be done at the 3' only, or also at the 5' end of the read? Or is it that only the Universal Adapter should be trimmed at the 5', and the indexed adapters at the 3'?

    NB1: Read length in 101bp as observed in FastQC. This was expected in the experimental setup but makes me wonder if I have any adapters to begin with.
    NB2: I have used FastQC to look at a sample of my data (around 198,000 seqs), I didn't find any overrpresented sequences but I did find increased 5-mer representation in the first 10 base pairs of my pairs (which I am assuming to be the 5' end?). There are also more GC fluctuations in those first 10bps as well.

    2. What is the minimum overlap that is effective to consitute a 'match' between the adapter and the read? Cutadapt has a default value of 3...but wouldn't that necessarily promote 'false matching' as well and lead to culling of sequences that don't have the adapter? I am considering a higher cutoff for the overlap, say 5bp, given the k-mer overrepresentations observed in FastQC.

    3. When providing the adapter sequences, seeing that the indexed adapters only differ at the barcode, is it still prudent to provide the entire sequence of the indexed adapters, in addition to entire sequence of the universal adapter? What is the bare minimum sequence people have provided for their adapters, both indexed and universal? Does it make a difference?

    4. I am assuming that the same indexed 5'-3' adapter is provided when trimming from both the R1 and R2 reads. I have not attempted to trim the reverse complement or the reversed sequence from either R1 or R2. If I am mistaken in this approach please correct me!

    My apologies for the multiple questions. Thank you in advance for your help with this!
    Much obliged!
    SEQNovice
    Last edited by SEQnovice; 11-29-2012, 11:02 AM.
  • SEQnovice
    Junior Member
    • Nov 2012
    • 6

    #2
    My questions have not been answered. Could someone kindly reply to some of them or at least direct me to the proper threads where this may have been discussed? I am new to this field and any feedback would be much appreciated!
    Thank you,
    SEQNovice

    Comment

    • ECO
      --Site Admin--
      • Oct 2007
      • 1360

      #3
      Originally posted by SEQnovice View Post
      My questions have not been answered. Could someone kindly reply to some of them or at least direct me to the proper threads where this may have been discussed? I am new to this field and any feedback would be much appreciated!
      Thank you,
      SEQNovice
      Patience, and searching. Please give your question more than 20 hours before bumping it.

      Comment

      • SEQnovice
        Junior Member
        • Nov 2012
        • 6

        #4
        My apologies, this is my first post here! Thanks for the tip, and if you do have any feedback I would appreciate it though!

        Comment

        • blanco
          Member
          • Apr 2012
          • 28

          #5
          I am also interested to know the answer to some of these questions.

          Perhaps to put it more simply: When trimming paired end reads, should the cutadapt command be exactly the same for both forward and reverse reads?

          Comment

          • fkrueger
            Senior Member
            • Sep 2009
            • 627

            #6
            Originally posted by blanco View Post
            I am also interested to know the answer to some of these questions.

            Perhaps to put it more simply: When trimming paired end reads, should the cutadapt command be exactly the same for both forward and reverse reads?
            Using the same command on both reads will most likely cause your paired-end files to go out of sync. We have written a small solution that calls Cutadapt with (what we think) sensible parameters (Trim Galore, available here); in it's default setting , e.g. trim_galore --paired file1.fq file 2.fq, it will trim Illumina adapters from both reads, quality trim reads to a Phred score of 20 and handle paired-end files as you would expect.

            Comment

            • blanco
              Member
              • Apr 2012
              • 28

              #7
              Thanks for your quick reply fkrueger - this looks to be something really useful. I have already asked one question in the appropriate thread: http://seqanswers.com/forums/showthr...ht=trim+galore

              Comment

              • Fernando Seixas
                Junior Member
                • Oct 2013
                • 8

                #8
                Hi all,

                Saw the 1st post of this thread and realized that I see exactly the same patterns described in point 1_NB2 - increased 5-mer representation in the first 10 base pairs, and GC fluctuations in those first 10bps as well (although very slight; and the same happens in the per base sequence content). Even after adapter trimming with cutadapt at both 5' and 3' ends and quality trimming (on Trimmomatic) these 'problems' persist. Any ideas of what might be causing this?

                Also, and I don't know if this relates with the previous question, the per sequence GC content hasn't an exactly normal distribution - there's a slight bump at the right part of the distribution.

                Thanks!
                Fernando

                Comment

                • GenoMax
                  Senior Member
                  • Feb 2008
                  • 7142

                  #9
                  There are several posts here that cover illumina sequencing and FastQC. Search for "fastqc duplication".

                  If one of the posts does not answer your question then can you post example plots?
                  Last edited by GenoMax; 01-30-2014, 04:36 PM.

                  Comment

                  • Fernando Seixas
                    Junior Member
                    • Oct 2013
                    • 8

                    #10
                    Thanks for the reply. But one thing I forgot to mention is the kind of data I have. It's whole genome sequencing data from hiseq2000 machine using Truseq library prep. And if I'm not wrong (my eyes are tired of so much reading xD), all the explanations I found for those behaviours I mentioned above refer to RNA-seq data, at least for the first 10 bp base content instability..

                    FastQC images of the problematic parameters are attached.

                    For the kmer analysis I attached both 7-mer and 10-mer analysis. I can see a repetitive pattern of 7bp if I allign the k-mers (CCTGGCTCCTGGCT) so looked for all possible 7bp sequences inside this pattern but still couldn't associate any of these to adapters/primers.

                    Thanks!
                    Attached Files
                    Fernando

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      The first two plots look ok. Is this a "GC" rich organism? Looks like there is some kind of duplication of sequences. Are the qualities acceptable across the entire read?
                      Last edited by GenoMax; 01-30-2014, 05:29 PM.

                      Comment

                      • Fernando Seixas
                        Junior Member
                        • Oct 2013
                        • 8

                        #12
                        But, is it really normal to have that slight fluctuations in the first 10 bp? Regarding the GC content, this data is from a mammalian genome. But even when I removed pcr duplicate these problems persisted.
                        And yes, the QS are good in the entire reads.

                        Another thing I forgot to mention is that this is PE data.
                        Fernando

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          Originally posted by Fernando Seixas View Post
                          But, is it really normal to have that slight fluctuations in the first 10 bp?
                          Yes. Here is a "good" sample example report posted on the FastQC site. http://www.bioinformatics.babraham.a...qc_report.html

                          But even when I removed pcr duplicate these problems persisted.
                          Another thing I forgot to mention is that this is PE data.
                          What is the aim of your experiment? Are you trying to do de novo assemblies or is there a closely related genome you can use as a reference?

                          As Simon (author of FastQC) had mentioned in some past posts here it is difficult for him to set "limits" for various tests in FastQC that are universally applicable. So having a dataset get a "fail" in one or more categories in FastQC does not automatically mean that there is a problem with the sample.

                          Have you tried doing analysis with the QC'ed data? How do those results look?
                          Last edited by GenoMax; 01-31-2014, 04:28 AM.

                          Comment

                          • Fernando Seixas
                            Junior Member
                            • Oct 2013
                            • 8

                            #14
                            Is for denovo assembly. I understand what you said about the limits of the FastQC not being universally applicable but even though I should worry about the GC content and k-mer plot, no?

                            An no, I'm still stuck in this part because I don't fell confident enough to go to the next steps.

                            Thanks!
                            Fernando

                            Comment

                            • GenoMax
                              Senior Member
                              • Feb 2008
                              • 7142

                              #15
                              Look at it this way. If there is a problem with the sample/library itself (at this point if the qualities are good then there is likely no technical issue with sequencing) you would not be able to do much short of redoing the experiment over.

                              Why not press ahead and give the de novo assembly a try. It may fail and you would be out of some compute cycles/time. Since it is a mammalian genome it is probably large(ish) so you are going to have to deal with a number of other computational challenges. Do you have enough sequence (theoretically) with adequate depth (10-15x or more) for the assembly tests?

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              38 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              45 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...