Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confusion regarding Illumina Adapter Trimming!

    Dear Experts,
    Please accept my apologies if this has been posted elsewhere. I am new to the analysis of RNA-seq data, and I am confused regarding trimming of my adapters from the FASTQ files using cutadapt. I have read through some of the posts but they have gotten me more confused!
    The details of my RNA-seq data are as follows:

    - The platform is Illumina, TruSeq
    - The FASTQ files are pair-ended (so I have an R1.fastq and R2.fastq for each of my samples). It is unknown which of the R1 and R2 represent the 'forward' or 'reverse' reads.
    - The files have been demultiplexed, so I have a barcode per sample which matches a specific barcode in a corresponding indexed adapter.
    - I have been provided with a Universal adapter and 5'-3' indexed adapters. I have checked the indexed adapters and they are all exactly identical except at the 6bp barcode in the middle of the sequence.

    Please kindly help me with the following:

    1. I am still trying to understand how Illumina TruSeq works but on principle, should the trimming be done at the 3' only, or also at the 5' end of the read? Or is it that only the Universal Adapter should be trimmed at the 5', and the indexed adapters at the 3'?

    NB1: Read length in 101bp as observed in FastQC. This was expected in the experimental setup but makes me wonder if I have any adapters to begin with.
    NB2: I have used FastQC to look at a sample of my data (around 198,000 seqs), I didn't find any overrpresented sequences but I did find increased 5-mer representation in the first 10 base pairs of my pairs (which I am assuming to be the 5' end?). There are also more GC fluctuations in those first 10bps as well.

    2. What is the minimum overlap that is effective to consitute a 'match' between the adapter and the read? Cutadapt has a default value of 3...but wouldn't that necessarily promote 'false matching' as well and lead to culling of sequences that don't have the adapter? I am considering a higher cutoff for the overlap, say 5bp, given the k-mer overrepresentations observed in FastQC.

    3. When providing the adapter sequences, seeing that the indexed adapters only differ at the barcode, is it still prudent to provide the entire sequence of the indexed adapters, in addition to entire sequence of the universal adapter? What is the bare minimum sequence people have provided for their adapters, both indexed and universal? Does it make a difference?

    4. I am assuming that the same indexed 5'-3' adapter is provided when trimming from both the R1 and R2 reads. I have not attempted to trim the reverse complement or the reversed sequence from either R1 or R2. If I am mistaken in this approach please correct me!

    My apologies for the multiple questions. Thank you in advance for your help with this!
    Much obliged!
    SEQNovice
    Last edited by SEQnovice; 11-29-2012, 11:02 AM.

  • #2
    My questions have not been answered. Could someone kindly reply to some of them or at least direct me to the proper threads where this may have been discussed? I am new to this field and any feedback would be much appreciated!
    Thank you,
    SEQNovice

    Comment


    • #3
      Originally posted by SEQnovice View Post
      My questions have not been answered. Could someone kindly reply to some of them or at least direct me to the proper threads where this may have been discussed? I am new to this field and any feedback would be much appreciated!
      Thank you,
      SEQNovice
      Patience, and searching. Please give your question more than 20 hours before bumping it.

      Comment


      • #4
        My apologies, this is my first post here! Thanks for the tip, and if you do have any feedback I would appreciate it though!

        Comment


        • #5
          I am also interested to know the answer to some of these questions.

          Perhaps to put it more simply: When trimming paired end reads, should the cutadapt command be exactly the same for both forward and reverse reads?

          Comment


          • #6
            Originally posted by blanco View Post
            I am also interested to know the answer to some of these questions.

            Perhaps to put it more simply: When trimming paired end reads, should the cutadapt command be exactly the same for both forward and reverse reads?
            Using the same command on both reads will most likely cause your paired-end files to go out of sync. We have written a small solution that calls Cutadapt with (what we think) sensible parameters (Trim Galore, available here); in it's default setting , e.g. trim_galore --paired file1.fq file 2.fq, it will trim Illumina adapters from both reads, quality trim reads to a Phred score of 20 and handle paired-end files as you would expect.

            Comment


            • #7
              Thanks for your quick reply fkrueger - this looks to be something really useful. I have already asked one question in the appropriate thread: http://seqanswers.com/forums/showthr...ht=trim+galore

              Comment


              • #8
                Hi all,

                Saw the 1st post of this thread and realized that I see exactly the same patterns described in point 1_NB2 - increased 5-mer representation in the first 10 base pairs, and GC fluctuations in those first 10bps as well (although very slight; and the same happens in the per base sequence content). Even after adapter trimming with cutadapt at both 5' and 3' ends and quality trimming (on Trimmomatic) these 'problems' persist. Any ideas of what might be causing this?

                Also, and I don't know if this relates with the previous question, the per sequence GC content hasn't an exactly normal distribution - there's a slight bump at the right part of the distribution.

                Thanks!
                Fernando

                Comment


                • #9
                  There are several posts here that cover illumina sequencing and FastQC. Search for "fastqc duplication".

                  If one of the posts does not answer your question then can you post example plots?
                  Last edited by GenoMax; 01-30-2014, 04:36 PM.

                  Comment


                  • #10
                    Thanks for the reply. But one thing I forgot to mention is the kind of data I have. It's whole genome sequencing data from hiseq2000 machine using Truseq library prep. And if I'm not wrong (my eyes are tired of so much reading xD), all the explanations I found for those behaviours I mentioned above refer to RNA-seq data, at least for the first 10 bp base content instability..

                    FastQC images of the problematic parameters are attached.

                    For the kmer analysis I attached both 7-mer and 10-mer analysis. I can see a repetitive pattern of 7bp if I allign the k-mers (CCTGGCTCCTGGCT) so looked for all possible 7bp sequences inside this pattern but still couldn't associate any of these to adapters/primers.

                    Thanks!
                    Attached Files
                    Fernando

                    Comment


                    • #11
                      The first two plots look ok. Is this a "GC" rich organism? Looks like there is some kind of duplication of sequences. Are the qualities acceptable across the entire read?
                      Last edited by GenoMax; 01-30-2014, 05:29 PM.

                      Comment


                      • #12
                        But, is it really normal to have that slight fluctuations in the first 10 bp? Regarding the GC content, this data is from a mammalian genome. But even when I removed pcr duplicate these problems persisted.
                        And yes, the QS are good in the entire reads.

                        Another thing I forgot to mention is that this is PE data.
                        Fernando

                        Comment


                        • #13
                          Originally posted by Fernando Seixas View Post
                          But, is it really normal to have that slight fluctuations in the first 10 bp?
                          Yes. Here is a "good" sample example report posted on the FastQC site. http://www.bioinformatics.babraham.a...qc_report.html

                          But even when I removed pcr duplicate these problems persisted.
                          Another thing I forgot to mention is that this is PE data.
                          What is the aim of your experiment? Are you trying to do de novo assemblies or is there a closely related genome you can use as a reference?

                          As Simon (author of FastQC) had mentioned in some past posts here it is difficult for him to set "limits" for various tests in FastQC that are universally applicable. So having a dataset get a "fail" in one or more categories in FastQC does not automatically mean that there is a problem with the sample.

                          Have you tried doing analysis with the QC'ed data? How do those results look?
                          Last edited by GenoMax; 01-31-2014, 04:28 AM.

                          Comment


                          • #14
                            Is for denovo assembly. I understand what you said about the limits of the FastQC not being universally applicable but even though I should worry about the GC content and k-mer plot, no?

                            An no, I'm still stuck in this part because I don't fell confident enough to go to the next steps.

                            Thanks!
                            Fernando

                            Comment


                            • #15
                              Look at it this way. If there is a problem with the sample/library itself (at this point if the qualities are good then there is likely no technical issue with sequencing) you would not be able to do much short of redoing the experiment over.

                              Why not press ahead and give the de novo assembly a try. It may fail and you would be out of some compute cycles/time. Since it is a mammalian genome it is probably large(ish) so you are going to have to deal with a number of other computational challenges. Do you have enough sequence (theoretically) with adequate depth (10-15x or more) for the assembly tests?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X