SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
TruSeq Adaptors reported by FastQC are true adaptors? Jiafen Bioinformatics 3 11-07-2012 02:07 PM
PCR primers+adaptors gio5 Sample Prep / Library Generation 2 01-05-2012 10:32 AM
Illumina Pair end primers and adaptors for multiplexing mimi_lupton Sample Prep / Library Generation 1 05-25-2011 08:57 AM
Removing primers Khanjan 454 Pyrosequencing 1 02-05-2010 12:09 PM
primers and adaptors xgm-1999 454 Pyrosequencing 2 09-22-2009 01:24 AM

Reply
 
Thread Tools
Old 04-09-2013, 05:08 AM   #1
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default Removing primers, adaptors, how to know if it's good?

Hi all

When I received my reads they had a significant overrepresentation of 5-base sequences (listed by FASTQC but not described as adaptors). I then used NGS Toolkit and IlluQC to filter the reads (IlluQC is supposed to remove adaptors even when a library isn't available). My FASTQC reports improved a lot, but I still have some kmer overrepresentation and there's a somewhat "wavy behaviour" in the first few bases. Anyway, I trimmed the reads (10 bases from the 3' end) and assembled them.

So, my questions are: should I have trimmed the reads from the 5' end also? Looking at the images, how can I tell if I still have a contamination? My assembly wasn't fantastic, but the coverage is relatively low, so I don't know if it's the best I can get with these reads. And a truly silly question: aren't the adaptors supposed to be in the ends of the reads? I'm now starting to think that they might be in the middle also, but in that case they can't be removed by simply trimming/clipping the ends.

Thanks a lot
Sandra
Attached Images
File Type: png per_base_quality.png (146.2 KB, 58 views)
File Type: png per_base_sequence_content.png (124.3 KB, 47 views)
File Type: png per_base_gc_content.png (115.8 KB, 34 views)
File Type: png kmer_profiles.png (137.8 KB, 57 views)

Last edited by SS Santos; 04-09-2013 at 05:33 AM. Reason: thumbnails not working
SS Santos is offline   Reply With Quote
Old 04-09-2013, 05:26 AM   #2
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default adaptors, how to know if it's good

Hi Sandra,

Yes, adapters are supposed to be at the 3' end of the reads,
but sometimes if your insert is very short, you wil end up reading
into the adapter sequence sooner, so you can get adapter sequences somewhere in the middle of the read.

Whether or not you should do more trimming depends on what you are doing with your data. If you are doing de novo assembly then it helps to remove as much of the adapters as possible.

If you know how to use Linux, then you can use 'grep' to check if certain sequences are present in your reads before and after trimming.

Trimmomatic will trim reads from the 5' ends of Illumina reads based on base quality scores. It will also remove adapters, but you do need to have a file with the adapter sequences. I think the latest version of Trimmomatic includes a file with adapter sequences.


Best wishes,
Maria
mastal is offline   Reply With Quote
Old 04-09-2013, 05:38 AM   #3
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default

Hi Maria, that was fast, I was still adding the images!

Yes, I've used Trimmomatic, in fact I feel like all the options (Fastx, seqtk, etc) generally give the same results. For example, one of the de novo assemblers I tested (Edena) also includes an option to truncate sequence length. I just wanted to be sure, looking at my reports (these are before and after filtering but not trimming of the last 10 bases), if everything is ok. How can I know, from looking at the reports? Should I have completely straight lines for the per base content, etc, including for the first bases?

Thanks
SS Santos is offline   Reply With Quote
Old 04-09-2013, 05:53 AM   #4
FWOS
Epigenomics NGS Beast
 
Location: New Jersey

Join Date: Oct 2010
Posts: 17
Default Library Type?

Hi Santos,

What kind of prep was done on these libraries? If the initial sequences are not diverse, you can see a wavy pattern in the first few bases. This happens with RNA seq libraries and can also occur in ChIP Seq, etc...

~FWOS
FWOS is offline   Reply With Quote
Old 04-09-2013, 06:02 AM   #5
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default

Hi

This is the method I received from the sequencing company.

We used a whole-genome shotgun sequencing strategy and Illumina Genome Analyser sequencing technology. A 100 bp paired-end run was performed with the strains described here in one lane. Genomic DNA was sheared by a nebulizer to generate DNA fragments for the Illumina Paried-End Sequencing method. DNA libraries (20 ng/μl) were constructed by ligating the specific oligonucleotides (Illumina adapters) designed for PE sequencing to both ends of DNA fragments with the TA cloning method. The ligated DNA was then size selected on a 2% agarose gel. DNA fragments of ~ 500 bp were excised from the preparative portion of the gel. DNA was then recovered using a Qiagen gel extraction kit and was PCR amplified to produce the final DNA library. Five picomoles of DNA from each strain were loaded onto two lanes of the sequencing chip, and the clusters were generated on the cluster generation station of the GAIIx using the Illumina cluster generation kit. Bacteriophage X174 DNA was used as a control. In the case of paired-end reads, distinct adaptors from Illumina were ligated to each end with PCR primers that allowed reading of each end as separate runs. The sequencing reaction was run for 100 cycles (tagging, imaging, and cleavage of one terminal base at a time), and four images of each tile on the chip were taken in different wavelengths for exciting each base-specific fluorophore. For paired-end reads, data were collected as two sets of matched 100-bp reads. Reads for each of the indexed samples were then separated using a custom Perl script. Image analysis and base calling were done using the Illumina GA Pipeline software.
SS Santos is offline   Reply With Quote
Old 04-09-2013, 06:23 AM   #6
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Hi Sandra,

The method looks like a pretty standard Illumina protocol.

Get the company that did the sequencing to tell you what version of Illumina kit was used for the sample prep and/or tellyou what adapter sequences they used.

Your QC images show that you have a very high %GC, is that what you expect for the species that you are sequencing?

The before and after images of per-base quality show an improvement in quality after filtering, but I think you could still have adapter sequences present, because they wouldn't necessarily affect the quality, or be present at the same place in the reads, although you do expect them more towards the 3' end. What filtering steps did you do?
mastal is offline   Reply With Quote
Old 04-09-2013, 06:43 AM   #7
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default

Hi Mastal

GC content should be 67%. I used the IlluQC tool for paired-end Illumina with standard parameters (Phred cut-off 20, cut-off for % of read length with that quality 70%). I had previously used Quake to correct technical errors, but the developer of the assembler I was testing at the time recommended me not to, because it can modify some reads. The input of IlluQC includes a primer/adaptor library, but I didn't have it and it runs without one. The "after" report after filtering only. I removed the last 10 bases before assembling.

I'm going to as ask the company for the adaptor sequences. Is there any way or tool that can be used to check if the adaptors are still present? In the first report, those peaks in the kmer profiles correspond to that?

Thanks
SS Santos is offline   Reply With Quote
Old 04-09-2013, 07:18 AM   #8
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default Removing primers, adaptors, how to know if it's good?

Quote:
Originally Posted by SS Santos View Post

Is there any way or tool that can be used to check if the adaptors are still present?
from a linux commandline:
grep -c 'adapter_sequence' reads.fastq

-c tells you how many times 'adapter_sequence' is found in the reads file.

grep -n -B1 -A3 'adapter_sequence' reads.fastq > reads_with_adapters.fastq

will give you the 4 lines of fastq for reads matching the adapter
mastal is offline   Reply With Quote
Old 04-09-2013, 08:24 AM   #9
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default

I got this reply from the company, when I asked for the adapter sequences. Not exactly what I was expecting! Does it mean that the adaptors are standard or something??



We used Illumina sequencing method to determine the geome sequeces of your bacterial strains.
The Solexa/Illumina sequencing method is similar to Sanger sequencing, but it uses modified dNTPs containing a terminator which blocks further polymerization- so only a single base can be added by a polymerase enzyme to each growing DNA copy strand. The sequencing reaction is conducted simultaneously on a very large number (many millions in fact) of different template molecules spread out on a solid surface. The terminator also contains a fluorescent label, which can be detected by a camera. Only a single fluorescent color is used, so each of the four bases must be added in a separate cycle of DNA synthesis and imaging. Following the addition of the four dNTPs to the templates, the images are recorded and the terminators are removed. This chemistry is called “reversible terminators”. Finally, another four cycles of dNTP additions are initiated. Since single bases are added to all templates in a uniform fashion, the sequencing process produces a set of DNA sequence reads of uniform length.
Chemistry for Next-Generation Sequencing
Illumina’s sequencing by synthesis (SBS) technology is the most successful and widely-adopted next-generation sequencing platform worldwide. TruSeq technology supports massively parallel sequencing using a proprietary reversible terminator-based method that enables detection of single bases as they are incorporated into growing DNA strands. A fluorescently-labeled terminator is imaged as each dNTP is added and then cleaved to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs are present during each sequencing cycle, natural competition minimizes incorporation bias. The end result is true base-by-base sequencing that enables the industry’s most accurate data for a broad range of applications.
SS Santos is offline   Reply With Quote
Old 04-09-2013, 09:06 AM   #10
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

the adapters are standard, but Illumina does change them from time to time,
so it would be useful for them to tell you the name of the kit they used and the version number, and also the sequences or Illumina codes of the barcodes they used with your samples.

As to your previous question about the kmer over-representation, I'm afraid I don't really understand the significance of the kmer plots in FastQC.
mastal is offline   Reply With Quote
Old 04-09-2013, 09:21 AM   #11
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Quote:
Originally Posted by SS Santos View Post


Only a single fluorescent color is used, so each of the four bases must be added in a separate cycle of DNA synthesis and imaging. Following the addition of the four dNTPs to the templates, the images are recorded and the terminators are removed. This chemistry is called “reversible terminators”. Finally, another four cycles of dNTP additions are initiated.

By the way, that bit is wrong, with the Illumina technology all 4 bases are added in each cycle, but each base is labelled with a different fluorescent dye.
mastal is offline   Reply With Quote
Old 04-10-2013, 08:02 AM   #12
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default

So they sent me this:

Adapters sequence:
5' P-GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

sample barcode sequence
IST4113 TAGCTT
IST4129 AGTTCC
IST4134 CTTGTA
IST439 AGTCAA

Do I create a text file with this, how can use it as in input for trimming/filtering tools?

Thanks
SS Santos is offline   Reply With Quote
Old 04-10-2013, 08:27 AM   #13
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

OK, those look like the Illumina TruSeq adapters.

The latest version of trimmomatic comes with a file containing those adapter sequences, so it should work fine with your files in the ILLUMINACLIP step.

To know whether things are improved before and after trimming, you should try and find how many times the adapters are present in your reads. Normally one adapter sequence is present in one of the read files, and the reverse complement of the other adapter is present in the file with the other reads of the pair.

Have a look at this web page from the U. of Texas at Austin, to have more of an idea how the Illumina adapters appear at the ends of the reads:

https://wikis.utexas.edu/display/GSA...+-+all+flavors

To count how many times the adapters are present in your file:

$grep -c 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT' reads.fastq

You may also want to try this with a substring of the adapter sequence, as not all the reads will end up reading into the full adapter sequence.

Hope this helps,
Maria
mastal is offline   Reply With Quote
Old 05-08-2013, 05:26 AM   #14
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default

Hi Maria

I finally got back to this. I used the grep -c command on my reads and it worked fine. Just a couple of really basic questions, if you can help me...

What's the difference between these 2 (the webpage you recommended is down)? Can I use just one of them to look for adapters? What does the P- mean?
5' P-GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

I also used the command to look for barcode sequences, and there were 9179369 in the raw data, and 5687733 in the filtered and cropped (10 bases from 3' ends) reads! This is still a lot right? Where are the barcodes in the reads? Near the ends? Can they be removed during trimming if I use their sequences?

Thanks again

Sandra
SS Santos is offline   Reply With Quote
Old 05-08-2013, 06:13 AM   #15
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default Removing primers, adaptors, how to know if it's good?

Hi Sandra,

The P stands for phosphate, it means there is a phosphate group at the 5' end of the adapter, but this will not appear in any of the sequence files.

The difference between the two sequences is that, if you have paired-end reads, one of the sequences or its reverse complement, will appear towards the ends of R1 when your DNA insert is too short and you read into the adapters,
and the other sequence or its reverse complement will appear in R2.

You can use grep as before to check which sequence appears in R1 or R2.

Trimmomatic should remove the barcode sequences because usually you have something like this:

5' read_sequence/adapter/barcode/adapter/flowcell_sequences 3'

and the shorter your DNA insert, the more of the various adapter sequences you get at the 3' end of your read.

trimmomatic usually looks for a good match with the adapter sequence that would be immediately adjacent to your DNA insert, and clips the read there,

5' read_sequence/

so that all the downstream stuff should be removed.
mastal is offline   Reply With Quote
Old 05-08-2013, 09:48 AM   #16
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default

Hi, you've been so helpful, thanks.

So, the adapter thing was what I suspected, and I think it's sorted. However, the barcodes are an unexpected problem. When I test the adapter sequences, even when using substrings, there are none in the trimmed reads. But there are thousands of barcodes (5687733).

The command line for trimmomatic is a mess, but here's what I used:

java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticPE readsR1.fastq readsR2.fastq forward_paired.fastq forward_unpaired.fastq reverse_paired.fastq reverse_unpaired.fastq CROP:90 MINLEN:90

I didn't use ILLUMINACLIP because I first used IlluQC from NGSToolkit which was supposed to get rid of that. From what I understand, I need to run trimmomatic again with other settings, but I'm worrying about something. Apparently, cutting the last 10 bases is not enough for 5687733 reads. So, if I want to keep the 90 read length, I'll loose all of these reads after it cuts off the barcodes. Maybe I should use a lower minimum length? It's a bit of a dilemma, because my coverage isn't that great...

(Here's a really basic question: how do I create a fasta file with the adapter and barcode sequences to use with ILLUMINACLIP? Or is there a file I can download with TruSeq adapters and barcodes?)
Thanks

Sandra
SS Santos is offline   Reply With Quote
Old 05-08-2013, 10:00 AM   #17
SS Santos
Member
 
Location: Lisbon

Join Date: Dec 2012
Posts: 12
Default

I'm sorry, I'd forgotten your reply (it was a long time ago!):

"The latest version of trimmomatic comes with a file containing those adapter sequences, so it should work fine with your files in the ILLUMINACLIP step."

So, I just need to call the TruSeq2-PE or TruSeq3-PE file in the command line? I understand they depend on the machine used, so I'll try to find out which one is better.

Thanks
SS Santos is offline   Reply With Quote
Old 05-08-2013, 10:11 AM   #18
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default Removing primers, adaptors, how to know if it's good?

The TruSeq2 and 3 are different versions of Illumina sample prep kits.
The adapter sequences that appear in your reads will be the reverse complement of the sequences in the fasta file.

For de novo assembly it is better to clean the reads. Is there a particular reason why you need all your reads to remain the same length?
mastal is offline   Reply With Quote
Old 05-08-2013, 12:39 PM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080
Default

Quote:
Originally Posted by SS Santos View Post
Hi, you've been so helpful, thanks.

So, the adapter thing was what I suspected, and I think it's sorted. However, the barcodes are an unexpected problem. When I test the adapter sequences, even when using substrings, there are none in the trimmed reads. But there are thousands of barcodes (5687733).

Thanks

Sandra
Be careful about interchangeably using terms "adapter" and "barcode".

Barcodes (in a multiplexed sample) are used to identify individual samples in a mixture. Generally a sequence provider will de-multiplex your samples (if they were indeed multiplexed). As a part of the de-multiplexing process the "barcodes" are identified/sorted and inserted into the sequence ID by the illumina pipeline software. They are also "read" as a separate read in Illumina sequencing.

An example barcode (after de-multiplexing has beeon done) is identified in "red" below (taken from wikipedia article on FASTQ format).

Quote:
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB
GenoMax is offline   Reply With Quote
Old 05-08-2013, 12:44 PM   #20
snorberg
Junior Member
 
Location: San Diego, CA

Join Date: Jan 2013
Posts: 5
Default

Hey,

Just reading thourgh this thread. A few have sugessted that you might be reading in to the index/adapter sequence on the 3' end library. Becuase you have perfomed size selection (500bp) and a 2x101bp run, it would be unlikely to have read in to the adapter sequence if your size-slection was accurate....your library insert-size should be approx 370bp (500bp-130bp adapter seq).

Have you determined the mean insert size of your library? That's my 2 cents.
snorberg is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:09 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO