Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing primers, adaptors, how to know if it's good?

    Hi all

    When I received my reads they had a significant overrepresentation of 5-base sequences (listed by FASTQC but not described as adaptors). I then used NGS Toolkit and IlluQC to filter the reads (IlluQC is supposed to remove adaptors even when a library isn't available). My FASTQC reports improved a lot, but I still have some kmer overrepresentation and there's a somewhat "wavy behaviour" in the first few bases. Anyway, I trimmed the reads (10 bases from the 3' end) and assembled them.

    So, my questions are: should I have trimmed the reads from the 5' end also? Looking at the images, how can I tell if I still have a contamination? My assembly wasn't fantastic, but the coverage is relatively low, so I don't know if it's the best I can get with these reads. And a truly silly question: aren't the adaptors supposed to be in the ends of the reads? I'm now starting to think that they might be in the middle also, but in that case they can't be removed by simply trimming/clipping the ends.

    Thanks a lot
    Sandra
    Attached Files
    Last edited by SS Santos; 04-09-2013, 04:33 AM. Reason: thumbnails not working

  • #2
    adaptors, how to know if it's good

    Hi Sandra,

    Yes, adapters are supposed to be at the 3' end of the reads,
    but sometimes if your insert is very short, you wil end up reading
    into the adapter sequence sooner, so you can get adapter sequences somewhere in the middle of the read.

    Whether or not you should do more trimming depends on what you are doing with your data. If you are doing de novo assembly then it helps to remove as much of the adapters as possible.

    If you know how to use Linux, then you can use 'grep' to check if certain sequences are present in your reads before and after trimming.

    Trimmomatic will trim reads from the 5' ends of Illumina reads based on base quality scores. It will also remove adapters, but you do need to have a file with the adapter sequences. I think the latest version of Trimmomatic includes a file with adapter sequences.


    Best wishes,
    Maria

    Comment


    • #3
      Hi Maria, that was fast, I was still adding the images!

      Yes, I've used Trimmomatic, in fact I feel like all the options (Fastx, seqtk, etc) generally give the same results. For example, one of the de novo assemblers I tested (Edena) also includes an option to truncate sequence length. I just wanted to be sure, looking at my reports (these are before and after filtering but not trimming of the last 10 bases), if everything is ok. How can I know, from looking at the reports? Should I have completely straight lines for the per base content, etc, including for the first bases?

      Thanks

      Comment


      • #4
        Library Type?

        Hi Santos,

        What kind of prep was done on these libraries? If the initial sequences are not diverse, you can see a wavy pattern in the first few bases. This happens with RNA seq libraries and can also occur in ChIP Seq, etc...

        ~FWOS

        Comment


        • #5
          Hi

          This is the method I received from the sequencing company.

          We used a whole-genome shotgun sequencing strategy and Illumina Genome Analyser sequencing technology. A 100 bp paired-end run was performed with the strains described here in one lane. Genomic DNA was sheared by a nebulizer to generate DNA fragments for the Illumina Paried-End Sequencing method. DNA libraries (20 ng/μl) were constructed by ligating the specific oligonucleotides (Illumina adapters) designed for PE sequencing to both ends of DNA fragments with the TA cloning method. The ligated DNA was then size selected on a 2% agarose gel. DNA fragments of ~ 500 bp were excised from the preparative portion of the gel. DNA was then recovered using a Qiagen gel extraction kit and was PCR amplified to produce the final DNA library. Five picomoles of DNA from each strain were loaded onto two lanes of the sequencing chip, and the clusters were generated on the cluster generation station of the GAIIx using the Illumina cluster generation kit. Bacteriophage X174 DNA was used as a control. In the case of paired-end reads, distinct adaptors from Illumina were ligated to each end with PCR primers that allowed reading of each end as separate runs. The sequencing reaction was run for 100 cycles (tagging, imaging, and cleavage of one terminal base at a time), and four images of each tile on the chip were taken in different wavelengths for exciting each base-specific fluorophore. For paired-end reads, data were collected as two sets of matched 100-bp reads. Reads for each of the indexed samples were then separated using a custom Perl script. Image analysis and base calling were done using the Illumina GA Pipeline software.

          Comment


          • #6
            Hi Sandra,

            The method looks like a pretty standard Illumina protocol.

            Get the company that did the sequencing to tell you what version of Illumina kit was used for the sample prep and/or tellyou what adapter sequences they used.

            Your QC images show that you have a very high %GC, is that what you expect for the species that you are sequencing?

            The before and after images of per-base quality show an improvement in quality after filtering, but I think you could still have adapter sequences present, because they wouldn't necessarily affect the quality, or be present at the same place in the reads, although you do expect them more towards the 3' end. What filtering steps did you do?

            Comment


            • #7
              Hi Mastal

              GC content should be 67%. I used the IlluQC tool for paired-end Illumina with standard parameters (Phred cut-off 20, cut-off for % of read length with that quality 70%). I had previously used Quake to correct technical errors, but the developer of the assembler I was testing at the time recommended me not to, because it can modify some reads. The input of IlluQC includes a primer/adaptor library, but I didn't have it and it runs without one. The "after" report after filtering only. I removed the last 10 bases before assembling.

              I'm going to as ask the company for the adaptor sequences. Is there any way or tool that can be used to check if the adaptors are still present? In the first report, those peaks in the kmer profiles correspond to that?

              Thanks

              Comment


              • #8
                Removing primers, adaptors, how to know if it's good?

                Originally posted by SS Santos View Post

                Is there any way or tool that can be used to check if the adaptors are still present?
                from a linux commandline:
                grep -c 'adapter_sequence' reads.fastq

                -c tells you how many times 'adapter_sequence' is found in the reads file.

                grep -n -B1 -A3 'adapter_sequence' reads.fastq > reads_with_adapters.fastq

                will give you the 4 lines of fastq for reads matching the adapter

                Comment


                • #9
                  I got this reply from the company, when I asked for the adapter sequences. Not exactly what I was expecting! Does it mean that the adaptors are standard or something??



                  We used Illumina sequencing method to determine the geome sequeces of your bacterial strains.
                  The Solexa/Illumina sequencing method is similar to Sanger sequencing, but it uses modified dNTPs containing a terminator which blocks further polymerization- so only a single base can be added by a polymerase enzyme to each growing DNA copy strand. The sequencing reaction is conducted simultaneously on a very large number (many millions in fact) of different template molecules spread out on a solid surface. The terminator also contains a fluorescent label, which can be detected by a camera. Only a single fluorescent color is used, so each of the four bases must be added in a separate cycle of DNA synthesis and imaging. Following the addition of the four dNTPs to the templates, the images are recorded and the terminators are removed. This chemistry is called “reversible terminators”. Finally, another four cycles of dNTP additions are initiated. Since single bases are added to all templates in a uniform fashion, the sequencing process produces a set of DNA sequence reads of uniform length.
                  Chemistry for Next-Generation Sequencing
                  Illumina’s sequencing by synthesis (SBS) technology is the most successful and widely-adopted next-generation sequencing platform worldwide. TruSeq technology supports massively parallel sequencing using a proprietary reversible terminator-based method that enables detection of single bases as they are incorporated into growing DNA strands. A fluorescently-labeled terminator is imaged as each dNTP is added and then cleaved to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs are present during each sequencing cycle, natural competition minimizes incorporation bias. The end result is true base-by-base sequencing that enables the industry’s most accurate data for a broad range of applications.

                  Comment


                  • #10
                    the adapters are standard, but Illumina does change them from time to time,
                    so it would be useful for them to tell you the name of the kit they used and the version number, and also the sequences or Illumina codes of the barcodes they used with your samples.

                    As to your previous question about the kmer over-representation, I'm afraid I don't really understand the significance of the kmer plots in FastQC.

                    Comment


                    • #11
                      Originally posted by SS Santos View Post


                      Only a single fluorescent color is used, so each of the four bases must be added in a separate cycle of DNA synthesis and imaging. Following the addition of the four dNTPs to the templates, the images are recorded and the terminators are removed. This chemistry is called “reversible terminators”. Finally, another four cycles of dNTP additions are initiated.

                      By the way, that bit is wrong, with the Illumina technology all 4 bases are added in each cycle, but each base is labelled with a different fluorescent dye.

                      Comment


                      • #12
                        So they sent me this:

                        Adapters sequence:
                        5' P-GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
                        5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

                        sample barcode sequence
                        IST4113 TAGCTT
                        IST4129 AGTTCC
                        IST4134 CTTGTA
                        IST439 AGTCAA

                        Do I create a text file with this, how can use it as in input for trimming/filtering tools?

                        Thanks

                        Comment


                        • #13
                          OK, those look like the Illumina TruSeq adapters.

                          The latest version of trimmomatic comes with a file containing those adapter sequences, so it should work fine with your files in the ILLUMINACLIP step.

                          To know whether things are improved before and after trimming, you should try and find how many times the adapters are present in your reads. Normally one adapter sequence is present in one of the read files, and the reverse complement of the other adapter is present in the file with the other reads of the pair.

                          Have a look at this web page from the U. of Texas at Austin, to have more of an idea how the Illumina adapters appear at the ends of the reads:



                          To count how many times the adapters are present in your file:

                          $grep -c 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT' reads.fastq

                          You may also want to try this with a substring of the adapter sequence, as not all the reads will end up reading into the full adapter sequence.

                          Hope this helps,
                          Maria

                          Comment


                          • #14
                            Hi Maria

                            I finally got back to this. I used the grep -c command on my reads and it worked fine. Just a couple of really basic questions, if you can help me...

                            What's the difference between these 2 (the webpage you recommended is down)? Can I use just one of them to look for adapters? What does the P- mean?
                            5' P-GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
                            5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

                            I also used the command to look for barcode sequences, and there were 9179369 in the raw data, and 5687733 in the filtered and cropped (10 bases from 3' ends) reads! This is still a lot right? Where are the barcodes in the reads? Near the ends? Can they be removed during trimming if I use their sequences?

                            Thanks again

                            Sandra

                            Comment


                            • #15
                              Removing primers, adaptors, how to know if it's good?

                              Hi Sandra,

                              The P stands for phosphate, it means there is a phosphate group at the 5' end of the adapter, but this will not appear in any of the sequence files.

                              The difference between the two sequences is that, if you have paired-end reads, one of the sequences or its reverse complement, will appear towards the ends of R1 when your DNA insert is too short and you read into the adapters,
                              and the other sequence or its reverse complement will appear in R2.

                              You can use grep as before to check which sequence appears in R1 or R2.

                              Trimmomatic should remove the barcode sequences because usually you have something like this:

                              5' read_sequence/adapter/barcode/adapter/flowcell_sequences 3'

                              and the shorter your DNA insert, the more of the various adapter sequences you get at the 3' end of your read.

                              trimmomatic usually looks for a good match with the adapter sequence that would be immediately adjacent to your DNA insert, and clips the read there,

                              5' read_sequence/

                              so that all the downstream stuff should be removed.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X