![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
TruSeq Adaptors reported by FastQC are true adaptors? | Jiafen | Bioinformatics | 3 | 11-07-2012 02:07 PM |
PCR primers+adaptors | gio5 | Sample Prep / Library Generation | 2 | 01-05-2012 10:32 AM |
Illumina Pair end primers and adaptors for multiplexing | mimi_lupton | Sample Prep / Library Generation | 1 | 05-25-2011 08:57 AM |
Removing primers | Khanjan | 454 Pyrosequencing | 1 | 02-05-2010 12:09 PM |
primers and adaptors | xgm-1999 | 454 Pyrosequencing | 2 | 09-22-2009 01:24 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
Hi all
When I received my reads they had a significant overrepresentation of 5-base sequences (listed by FASTQC but not described as adaptors). I then used NGS Toolkit and IlluQC to filter the reads (IlluQC is supposed to remove adaptors even when a library isn't available). My FASTQC reports improved a lot, but I still have some kmer overrepresentation and there's a somewhat "wavy behaviour" in the first few bases. Anyway, I trimmed the reads (10 bases from the 3' end) and assembled them. So, my questions are: should I have trimmed the reads from the 5' end also? Looking at the images, how can I tell if I still have a contamination? My assembly wasn't fantastic, but the coverage is relatively low, so I don't know if it's the best I can get with these reads. And a truly silly question: aren't the adaptors supposed to be in the ends of the reads? I'm now starting to think that they might be in the middle also, but in that case they can't be removed by simply trimming/clipping the ends. Thanks a lot Sandra Last edited by SS Santos; 04-09-2013 at 05:33 AM. Reason: thumbnails not working |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
Hi Sandra,
Yes, adapters are supposed to be at the 3' end of the reads, but sometimes if your insert is very short, you wil end up reading into the adapter sequence sooner, so you can get adapter sequences somewhere in the middle of the read. Whether or not you should do more trimming depends on what you are doing with your data. If you are doing de novo assembly then it helps to remove as much of the adapters as possible. If you know how to use Linux, then you can use 'grep' to check if certain sequences are present in your reads before and after trimming. Trimmomatic will trim reads from the 5' ends of Illumina reads based on base quality scores. It will also remove adapters, but you do need to have a file with the adapter sequences. I think the latest version of Trimmomatic includes a file with adapter sequences. Best wishes, Maria |
![]() |
![]() |
![]() |
#3 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
Hi Maria, that was fast, I was still adding the images!
Yes, I've used Trimmomatic, in fact I feel like all the options (Fastx, seqtk, etc) generally give the same results. For example, one of the de novo assemblers I tested (Edena) also includes an option to truncate sequence length. I just wanted to be sure, looking at my reports (these are before and after filtering but not trimming of the last 10 bases), if everything is ok. How can I know, from looking at the reports? Should I have completely straight lines for the per base content, etc, including for the first bases? Thanks |
![]() |
![]() |
![]() |
#4 |
Epigenomics NGS Beast
Location: New Jersey Join Date: Oct 2010
Posts: 17
|
![]()
Hi Santos,
What kind of prep was done on these libraries? If the initial sequences are not diverse, you can see a wavy pattern in the first few bases. This happens with RNA seq libraries and can also occur in ChIP Seq, etc... ~FWOS |
![]() |
![]() |
![]() |
#5 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
Hi
This is the method I received from the sequencing company. We used a whole-genome shotgun sequencing strategy and Illumina Genome Analyser sequencing technology. A 100 bp paired-end run was performed with the strains described here in one lane. Genomic DNA was sheared by a nebulizer to generate DNA fragments for the Illumina Paried-End Sequencing method. DNA libraries (20 ng/μl) were constructed by ligating the specific oligonucleotides (Illumina adapters) designed for PE sequencing to both ends of DNA fragments with the TA cloning method. The ligated DNA was then size selected on a 2% agarose gel. DNA fragments of ~ 500 bp were excised from the preparative portion of the gel. DNA was then recovered using a Qiagen gel extraction kit and was PCR amplified to produce the final DNA library. Five picomoles of DNA from each strain were loaded onto two lanes of the sequencing chip, and the clusters were generated on the cluster generation station of the GAIIx using the Illumina cluster generation kit. Bacteriophage X174 DNA was used as a control. In the case of paired-end reads, distinct adaptors from Illumina were ligated to each end with PCR primers that allowed reading of each end as separate runs. The sequencing reaction was run for 100 cycles (tagging, imaging, and cleavage of one terminal base at a time), and four images of each tile on the chip were taken in different wavelengths for exciting each base-specific fluorophore. For paired-end reads, data were collected as two sets of matched 100-bp reads. Reads for each of the indexed samples were then separated using a custom Perl script. Image analysis and base calling were done using the Illumina GA Pipeline software. |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
Hi Sandra,
The method looks like a pretty standard Illumina protocol. Get the company that did the sequencing to tell you what version of Illumina kit was used for the sample prep and/or tellyou what adapter sequences they used. Your QC images show that you have a very high %GC, is that what you expect for the species that you are sequencing? The before and after images of per-base quality show an improvement in quality after filtering, but I think you could still have adapter sequences present, because they wouldn't necessarily affect the quality, or be present at the same place in the reads, although you do expect them more towards the 3' end. What filtering steps did you do? |
![]() |
![]() |
![]() |
#7 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
Hi Mastal
GC content should be 67%. I used the IlluQC tool for paired-end Illumina with standard parameters (Phred cut-off 20, cut-off for % of read length with that quality 70%). I had previously used Quake to correct technical errors, but the developer of the assembler I was testing at the time recommended me not to, because it can modify some reads. The input of IlluQC includes a primer/adaptor library, but I didn't have it and it runs without one. The "after" report after filtering only. I removed the last 10 bases before assembling. I'm going to as ask the company for the adaptor sequences. Is there any way or tool that can be used to check if the adaptors are still present? In the first report, those peaks in the kmer profiles correspond to that? Thanks |
![]() |
![]() |
![]() |
#8 | |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]() Quote:
grep -c 'adapter_sequence' reads.fastq -c tells you how many times 'adapter_sequence' is found in the reads file. grep -n -B1 -A3 'adapter_sequence' reads.fastq > reads_with_adapters.fastq will give you the 4 lines of fastq for reads matching the adapter |
|
![]() |
![]() |
![]() |
#9 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
I got this reply from the company, when I asked for the adapter sequences. Not exactly what I was expecting! Does it mean that the adaptors are standard or something??
We used Illumina sequencing method to determine the geome sequeces of your bacterial strains. The Solexa/Illumina sequencing method is similar to Sanger sequencing, but it uses modified dNTPs containing a terminator which blocks further polymerization- so only a single base can be added by a polymerase enzyme to each growing DNA copy strand. The sequencing reaction is conducted simultaneously on a very large number (many millions in fact) of different template molecules spread out on a solid surface. The terminator also contains a fluorescent label, which can be detected by a camera. Only a single fluorescent color is used, so each of the four bases must be added in a separate cycle of DNA synthesis and imaging. Following the addition of the four dNTPs to the templates, the images are recorded and the terminators are removed. This chemistry is called “reversible terminators”. Finally, another four cycles of dNTP additions are initiated. Since single bases are added to all templates in a uniform fashion, the sequencing process produces a set of DNA sequence reads of uniform length. Chemistry for Next-Generation Sequencing Illumina’s sequencing by synthesis (SBS) technology is the most successful and widely-adopted next-generation sequencing platform worldwide. TruSeq technology supports massively parallel sequencing using a proprietary reversible terminator-based method that enables detection of single bases as they are incorporated into growing DNA strands. A fluorescently-labeled terminator is imaged as each dNTP is added and then cleaved to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs are present during each sequencing cycle, natural competition minimizes incorporation bias. The end result is true base-by-base sequencing that enables the industry’s most accurate data for a broad range of applications. |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
the adapters are standard, but Illumina does change them from time to time,
so it would be useful for them to tell you the name of the kit they used and the version number, and also the sequences or Illumina codes of the barcodes they used with your samples. As to your previous question about the kmer over-representation, I'm afraid I don't really understand the significance of the kmer plots in FastQC. |
![]() |
![]() |
![]() |
#11 | |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]() Quote:
By the way, that bit is wrong, with the Illumina technology all 4 bases are added in each cycle, but each base is labelled with a different fluorescent dye. |
|
![]() |
![]() |
![]() |
#12 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
So they sent me this:
Adapters sequence: 5' P-GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG 5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT sample barcode sequence IST4113 TAGCTT IST4129 AGTTCC IST4134 CTTGTA IST439 AGTCAA Do I create a text file with this, how can use it as in input for trimming/filtering tools? Thanks |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
OK, those look like the Illumina TruSeq adapters.
The latest version of trimmomatic comes with a file containing those adapter sequences, so it should work fine with your files in the ILLUMINACLIP step. To know whether things are improved before and after trimming, you should try and find how many times the adapters are present in your reads. Normally one adapter sequence is present in one of the read files, and the reverse complement of the other adapter is present in the file with the other reads of the pair. Have a look at this web page from the U. of Texas at Austin, to have more of an idea how the Illumina adapters appear at the ends of the reads: https://wikis.utexas.edu/display/GSA...+-+all+flavors To count how many times the adapters are present in your file: $grep -c 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT' reads.fastq You may also want to try this with a substring of the adapter sequence, as not all the reads will end up reading into the full adapter sequence. Hope this helps, Maria |
![]() |
![]() |
![]() |
#14 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
Hi Maria
I finally got back to this. I used the grep -c command on my reads and it worked fine. Just a couple of really basic questions, if you can help me... What's the difference between these 2 (the webpage you recommended is down)? Can I use just one of them to look for adapters? What does the P- mean? 5' P-GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG 5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT I also used the command to look for barcode sequences, and there were 9179369 in the raw data, and 5687733 in the filtered and cropped (10 bases from 3' ends) reads! This is still a lot right? Where are the barcodes in the reads? Near the ends? Can they be removed during trimming if I use their sequences? Thanks again Sandra |
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
Hi Sandra,
The P stands for phosphate, it means there is a phosphate group at the 5' end of the adapter, but this will not appear in any of the sequence files. The difference between the two sequences is that, if you have paired-end reads, one of the sequences or its reverse complement, will appear towards the ends of R1 when your DNA insert is too short and you read into the adapters, and the other sequence or its reverse complement will appear in R2. You can use grep as before to check which sequence appears in R1 or R2. Trimmomatic should remove the barcode sequences because usually you have something like this: 5' read_sequence/adapter/barcode/adapter/flowcell_sequences 3' and the shorter your DNA insert, the more of the various adapter sequences you get at the 3' end of your read. trimmomatic usually looks for a good match with the adapter sequence that would be immediately adjacent to your DNA insert, and clips the read there, 5' read_sequence/ so that all the downstream stuff should be removed. |
![]() |
![]() |
![]() |
#16 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
Hi, you've been so helpful, thanks.
So, the adapter thing was what I suspected, and I think it's sorted. However, the barcodes are an unexpected problem. When I test the adapter sequences, even when using substrings, there are none in the trimmed reads. But there are thousands of barcodes (5687733). The command line for trimmomatic is a mess, but here's what I used: java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticPE readsR1.fastq readsR2.fastq forward_paired.fastq forward_unpaired.fastq reverse_paired.fastq reverse_unpaired.fastq CROP:90 MINLEN:90 I didn't use ILLUMINACLIP because I first used IlluQC from NGSToolkit which was supposed to get rid of that. From what I understand, I need to run trimmomatic again with other settings, but I'm worrying about something. Apparently, cutting the last 10 bases is not enough for 5687733 reads. So, if I want to keep the 90 read length, I'll loose all of these reads after it cuts off the barcodes. Maybe I should use a lower minimum length? It's a bit of a dilemma, because my coverage isn't that great... (Here's a really basic question: how do I create a fasta file with the adapter and barcode sequences to use with ILLUMINACLIP? Or is there a file I can download with TruSeq adapters and barcodes?) Thanks Sandra |
![]() |
![]() |
![]() |
#17 |
Member
Location: Lisbon Join Date: Dec 2012
Posts: 12
|
![]()
I'm sorry, I'd forgotten your reply (it was a long time ago!):
"The latest version of trimmomatic comes with a file containing those adapter sequences, so it should work fine with your files in the ILLUMINACLIP step." So, I just need to call the TruSeq2-PE or TruSeq3-PE file in the command line? I understand they depend on the machine used, so I'll try to find out which one is better. Thanks |
![]() |
![]() |
![]() |
#18 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
The TruSeq2 and 3 are different versions of Illumina sample prep kits.
The adapter sequences that appear in your reads will be the reverse complement of the sequences in the fasta file. For de novo assembly it is better to clean the reads. Is there a particular reason why you need all your reads to remain the same length? |
![]() |
![]() |
![]() |
#19 | ||
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]() Quote:
Barcodes (in a multiplexed sample) are used to identify individual samples in a mixture. Generally a sequence provider will de-multiplex your samples (if they were indeed multiplexed). As a part of the de-multiplexing process the "barcodes" are identified/sorted and inserted into the sequence ID by the illumina pipeline software. They are also "read" as a separate read in Illumina sequencing. An example barcode (after de-multiplexing has beeon done) is identified in "red" below (taken from wikipedia article on FASTQ format). Quote:
|
||
![]() |
![]() |
![]() |
#20 |
Junior Member
Location: San Diego, CA Join Date: Jan 2013
Posts: 5
|
![]()
Hey,
Just reading thourgh this thread. A few have sugessted that you might be reading in to the index/adapter sequence on the 3' end library. Becuase you have perfomed size selection (500bp) and a 2x101bp run, it would be unlikely to have read in to the adapter sequence if your size-slection was accurate....your library insert-size should be approx 370bp (500bp-130bp adapter seq). Have you determined the mean insert size of your library? That's my 2 cents. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|