Seqanswers Leaderboard Ad

**mastal** · 04-09-2013, 04:26 AM

adaptors, how to know if it's good

Hi Sandra,

Yes, adapters are supposed to be at the 3' end of the reads,
but sometimes if your insert is very short, you wil end up reading
into the adapter sequence sooner, so you can get adapter sequences somewhere in the middle of the read.

Whether or not you should do more trimming depends on what you are doing with your data. If you are doing de novo assembly then it helps to remove as much of the adapters as possible.

If you know how to use Linux, then you can use 'grep' to check if certain sequences are present in your reads before and after trimming.

Trimmomatic will trim reads from the 5' ends of Illumina reads based on base quality scores. It will also remove adapters, but you do need to have a file with the adapter sequences. I think the latest version of Trimmomatic includes a file with adapter sequences.

Best wishes,
Maria

**SS Santos** · 04-09-2013, 04:38 AM

Hi Maria, that was fast, I was still adding the images!

Yes, I've used Trimmomatic, in fact I feel like all the options (Fastx, seqtk, etc) generally give the same results. For example, one of the de novo assemblers I tested (Edena) also includes an option to truncate sequence length. I just wanted to be sure, looking at my reports (these are before and after filtering but not trimming of the last 10 bases), if everything is ok. How can I know, from looking at the reports? Should I have completely straight lines for the per base content, etc, including for the first bases?

Thanks

**FWOS** · 04-09-2013, 04:53 AM

Library Type?

Hi Santos,

What kind of prep was done on these libraries? If the initial sequences are not diverse, you can see a wavy pattern in the first few bases. This happens with RNA seq libraries and can also occur in ChIP Seq, etc...

~FWOS

**SS Santos** · 04-09-2013, 05:02 AM

Hi

This is the method I received from the sequencing company.

We used a whole-genome shotgun sequencing strategy and Illumina Genome Analyser sequencing technology. A 100 bp paired-end run was performed with the strains described here in one lane. Genomic DNA was sheared by a nebulizer to generate DNA fragments for the Illumina Paried-End Sequencing method. DNA libraries (20 ng/μl) were constructed by ligating the specific oligonucleotides (Illumina adapters) designed for PE sequencing to both ends of DNA fragments with the TA cloning method. The ligated DNA was then size selected on a 2% agarose gel. DNA fragments of ~ 500 bp were excised from the preparative portion of the gel. DNA was then recovered using a Qiagen gel extraction kit and was PCR amplified to produce the final DNA library. Five picomoles of DNA from each strain were loaded onto two lanes of the sequencing chip, and the clusters were generated on the cluster generation station of the GAIIx using the Illumina cluster generation kit. Bacteriophage X174 DNA was used as a control. In the case of paired-end reads, distinct adaptors from Illumina were ligated to each end with PCR primers that allowed reading of each end as separate runs. The sequencing reaction was run for 100 cycles (tagging, imaging, and cleavage of one terminal base at a time), and four images of each tile on the chip were taken in different wavelengths for exciting each base-specific fluorophore. For paired-end reads, data were collected as two sets of matched 100-bp reads. Reads for each of the indexed samples were then separated using a custom Perl script. Image analysis and base calling were done using the Illumina GA Pipeline software.

**mastal** · 04-09-2013, 05:23 AM

Hi Sandra,

The method looks like a pretty standard Illumina protocol.

Get the company that did the sequencing to tell you what version of Illumina kit was used for the sample prep and/or tellyou what adapter sequences they used.

Your QC images show that you have a very high %GC, is that what you expect for the species that you are sequencing?

The before and after images of per-base quality show an improvement in quality after filtering, but I think you could still have adapter sequences present, because they wouldn't necessarily affect the quality, or be present at the same place in the reads, although you do expect them more towards the 3' end. What filtering steps did you do?

**SS Santos** · 04-09-2013, 05:43 AM

Hi Mastal

GC content should be 67%. I used the IlluQC tool for paired-end Illumina with standard parameters (Phred cut-off 20, cut-off for % of read length with that quality 70%). I had previously used Quake to correct technical errors, but the developer of the assembler I was testing at the time recommended me not to, because it can modify some reads. The input of IlluQC includes a primer/adaptor library, but I didn't have it and it runs without one. The "after" report after filtering only. I removed the last 10 bases before assembling.

I'm going to as ask the company for the adaptor sequences. Is there any way or tool that can be used to check if the adaptors are still present? In the first report, those peaks in the kmer profiles correspond to that?

Thanks

**mastal** · 04-09-2013, 06:18 AM

Removing primers, adaptors, how to know if it's good?

Originally posted by SS Santos View Post

Is there any way or tool that can be used to check if the adaptors are still present?

from a linux commandline:
grep -c 'adapter_sequence' reads.fastq

-c tells you how many times 'adapter_sequence' is found in the reads file.

grep -n -B1 -A3 'adapter_sequence' reads.fastq > reads_with_adapters.fastq

will give you the 4 lines of fastq for reads matching the adapter

**SS Santos** · 04-09-2013, 07:24 AM

I got this reply from the company, when I asked for the adapter sequences. Not exactly what I was expecting! Does it mean that the adaptors are standard or something??

We used Illumina sequencing method to determine the geome sequeces of your bacterial strains.
The Solexa/Illumina sequencing method is similar to Sanger sequencing, but it uses modified dNTPs containing a terminator which blocks further polymerization- so only a single base can be added by a polymerase enzyme to each growing DNA copy strand. The sequencing reaction is conducted simultaneously on a very large number (many millions in fact) of different template molecules spread out on a solid surface. The terminator also contains a fluorescent label, which can be detected by a camera. Only a single fluorescent color is used, so each of the four bases must be added in a separate cycle of DNA synthesis and imaging. Following the addition of the four dNTPs to the templates, the images are recorded and the terminators are removed. This chemistry is called “reversible terminators”. Finally, another four cycles of dNTP additions are initiated. Since single bases are added to all templates in a uniform fashion, the sequencing process produces a set of DNA sequence reads of uniform length.
Chemistry for Next-Generation Sequencing
Illumina’s sequencing by synthesis (SBS) technology is the most successful and widely-adopted next-generation sequencing platform worldwide. TruSeq technology supports massively parallel sequencing using a proprietary reversible terminator-based method that enables detection of single bases as they are incorporated into growing DNA strands. A fluorescently-labeled terminator is imaged as each dNTP is added and then cleaved to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs are present during each sequencing cycle, natural competition minimizes incorporation bias. The end result is true base-by-base sequencing that enables the industry’s most accurate data for a broad range of applications.

**mastal** · 04-09-2013, 08:06 AM

the adapters are standard, but Illumina does change them from time to time,
so it would be useful for them to tell you the name of the kit they used and the version number, and also the sequences or Illumina codes of the barcodes they used with your samples.

As to your previous question about the kmer over-representation, I'm afraid I don't really understand the significance of the kmer plots in FastQC.

**mastal** · 04-09-2013, 08:21 AM

Originally posted by SS Santos View Post

Only a single fluorescent color is used, so each of the four bases must be added in a separate cycle of DNA synthesis and imaging. Following the addition of the four dNTPs to the templates, the images are recorded and the terminators are removed. This chemistry is called “reversible terminators”. Finally, another four cycles of dNTP additions are initiated.

By the way, that bit is wrong, with the Illumina technology all 4 bases are added in each cycle, but each base is labelled with a different fluorescent dye.

**SS Santos** · 04-10-2013, 07:02 AM

So they sent me this:

Adapters sequence:
5' P-GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

sample barcode sequence
IST4113 TAGCTT
IST4129 AGTTCC
IST4134 CTTGTA
IST439 AGTCAA

Do I create a text file with this, how can use it as in input for trimming/filtering tools?

Thanks

**mastal** · 04-10-2013, 07:27 AM

OK, those look like the Illumina TruSeq adapters.

The latest version of trimmomatic comes with a file containing those adapter sequences, so it should work fine with your files in the ILLUMINACLIP step.

To know whether things are improved before and after trimming, you should try and find how many times the adapters are present in your reads. Normally one adapter sequence is present in one of the read files, and the reverse complement of the other adapter is present in the file with the other reads of the pair.

Have a look at this web page from the U. of Texas at Austin, to have more of an idea how the Illumina adapters appear at the ends of the reads:

Illumina - all flavors - Genomic Sequencing and Analysis Facility User Support Wiki - UT Austin Wikis

https://wikis.utexas.edu/display/GSAF/Illumina+-+all+flavors

To count how many times the adapters are present in your file:

$grep -c 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT' reads.fastq

You may also want to try this with a substring of the adapter sequence, as not all the reads will end up reading into the full adapter sequence.

Hope this helps,
Maria

**SS Santos** · 05-08-2013, 04:26 AM

Hi Maria

I finally got back to this. I used the grep -c command on my reads and it worked fine. Just a couple of really basic questions, if you can help me...

What's the difference between these 2 (the webpage you recommended is down)? Can I use just one of them to look for adapters? What does the P- mean?
5' P-GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

I also used the command to look for barcode sequences, and there were 9179369 in the raw data, and 5687733 in the filtered and cropped (10 bases from 3' ends) reads! This is still a lot right? Where are the barcodes in the reads? Near the ends? Can they be removed during trimming if I use their sequences?

Thanks again

Sandra

**mastal** · 05-08-2013, 05:13 AM

Removing primers, adaptors, how to know if it's good?

Hi Sandra,

The P stands for phosphate, it means there is a phosphate group at the 5' end of the adapter, but this will not appear in any of the sequence files.

The difference between the two sequences is that, if you have paired-end reads, one of the sequences or its reverse complement, will appear towards the ends of R1 when your DNA insert is too short and you read into the adapters,
and the other sequence or its reverse complement will appear in R2.

You can use grep as before to check which sequence appears in R1 or R2.

Trimmomatic should remove the barcode sequences because usually you have something like this:

5' read_sequence/adapter/barcode/adapter/flowcell_sequences 3'

and the shorter your DNA insert, the more of the various adapter sequences you get at the 3' end of your read.

trimmomatic usually looks for a good match with the adapter sequence that would be immediately adjacent to your DNA insert, and clips the read there,

5' read_sequence/

so that all the downstream stuff should be removed.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Removing primers, adaptors, how to know if it's good?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News