SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Paired-end Illumina RNA-seq adapter trimming fabrice Bioinformatics 8 01-05-2015 07:48 AM
FASTXtoolkit adapter trimming Mark Bioinformatics 36 10-24-2013 10:28 AM
SOLiD Barcoded Adapter Trimming DrDTonge Bioinformatics 4 12-06-2011 07:33 AM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 12:53 PM
Adapter trimming in MAQ for SOLiD lgoff Bioinformatics 0 05-11-2009 09:55 AM

Reply
 
Thread Tools
Old 02-23-2012, 09:00 PM   #1
vinay052003
Member
 
Location: Atlanta, US

Join Date: Jan 2010
Posts: 59
Default paired-end adapter trimming

Hi all,
Sorry for the two frequent posts in a row.
I have paired-end RNA-Seq data generated from Illumina GA as well as HiSeq. I obtained this data from SRA (NCBI).
I am not sure if adapter trimming is a regular practice (must-do) before I can go for the reference genome alignment of these reads.
How would I know if I needed adapter trimming?

Thanks
vinay052003 is offline   Reply With Quote
Old 02-24-2012, 02:09 AM   #2
tir_al
Member
 
Location: Croatia

Join Date: Sep 2010
Posts: 22
Default

If you run FastQC, it will tell you whether some of the overrepresented sequences in the sample correspond to known illumina adapters.
tir_al is offline   Reply With Quote
Old 02-24-2012, 04:12 AM   #3
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

If possible, you should compare results with or without adapter trimming. I have found little usage of trimming adapters, even with samples with a high percentage of adapters present (>10%). Trimmomatic will do the job for you of trimming.
kga1978 is offline   Reply With Quote
Old 02-27-2012, 11:11 AM   #4
vinay052003
Member
 
Location: Atlanta, US

Join Date: Jan 2010
Posts: 59
Default

Hi all,
Thanks for your quick suggestions. I think I figured out the problem. Adapter trimming is needed mainly if insert size very low (lower than the number of sequencing cycles or read length). On the 5' end sequencing starts with the first base of the actual RNA/cDNA sequence and the 5' adapter+primer act as primer binding site for the sequencing primer.
vinay052003 is offline   Reply With Quote
Old 02-29-2012, 12:07 AM   #5
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by kga1978 View Post
If possible, you should compare results with or without adapter trimming. I have found little usage of trimming adapters, even with samples with a high percentage of adapters present (>10%). Trimmomatic will do the job for you of trimming.
Agreed - adapter rimming isn't so critical for any kind of reference-based alignments, including RNA-seq. Usually the adapter-containing read will fail to align, and even if it is partially useful data, the useful part may well be too short to reliably align. I tend to do it anyway, since clearing out the obvious junk speeds up the alignment step, and usually get a small benefit, but it is pretty marginal.

For de-novo work though, adapter trimming is essential.

And glad you're finding trimmomatic useful.
tonybolger is offline   Reply With Quote
Old 03-26-2012, 11:56 PM   #6
Palgrave
Member
 
Location: norway

Join Date: Aug 2011
Posts: 73
Default

Do you do revers complement of the adapters before trimming: For example do I have to revers complement this adapter:

TruSeq Adapter, Index 1
5 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
Palgrave is offline   Reply With Quote
Old 03-27-2012, 12:12 AM   #7
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by Palgrave View Post
Do you do revers complement of the adapters before trimming: For example do I have to revers complement this adapter:

TruSeq Adapter, Index 1
5 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
No - trimmomatic looks only for the sequence provided, and if appropriately named, only in the forward or reverse read. This is both for performance reasons and to allow user control of exactly what is removed.

Whether you should look for the reverse-complement is a more complex question - depending on how the sequence is used during library prep, it may be extremely unlikely (or quite likely) to end up in a reverse-complement state. In this case, there is a trade-off between removing a small number of genuine occurrences and removing good data which is merely 'adapter-like'.

Generally I would advise focussing on getting the most common adapter / occurrence combinations out of the data, but not trying too hard to get every last one, at the cost of real data.
tonybolger is offline   Reply With Quote
Old 03-27-2012, 04:38 AM   #8
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Thanks for posting here, Tony! I was wondering about the "palindromic" mode of Trimmomatic. I have assumed that in my particular experiment, we will have read into the TruSeq indexed adapter in some cases (for the forward read). Accordingly, I tried to tag the corresponding sequences in my ILLUMINACLIP file with "Prefix" at the start, and "/1" at the end of the name:

>Prefix75_TruSeq Adapter, Index 1/1
(some sequence ...)

This is how I interpreted the instructions at http://www.usadellab.org/cms/index.php?page=trimmomatic

However, when I run Trimmomatic using this file, I get the following output: (amongst other things)

ILLUMINACLIP: Using 0 prefix pairs, 148 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences

I had been expecting that the sequences I had tagged with "Prefix" and "/1" would have been tagged as "forward-only" sequences. Now all the removal is done in 'simple' mode (which is no disaster, of course.)

Also, the manual seems to talk about pairs of sequences:

Quote:
For 'Palindrome' clipping, the sequence names should both start with 'Prefix', and end in '/1' for the forward adapter and '/2' for the reverse adapter
Is the idea that you feed Trimmomatic pairs of adapters where /1 is the one you expect to read into in the forward read, and /2 the one you expect to read into in the reverse read? In that case, do these sequences have to have identical names (module the /1 or /2), or do they get paired just based on /2 appearing after /1 in the ILLUMINACLIP file?

In my case, should I have included a /2 sequence expected to appear in the reverse read for each of my /1 sequences?

I hope the questions weren't too unclear?

Last edited by kopi-o; 03-27-2012 at 04:39 AM. Reason: grammar
kopi-o is offline   Reply With Quote
Old 03-27-2012, 04:55 AM   #9
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by kopi-o View Post
Is the idea that you feed Trimmomatic pairs of adapters where /1 is the one you expect to read into in the forward read, and /2 the one you expect to read into in the reverse read? In that case, do these sequences have to have identical names (module the /1 or /2), or do they get paired just based on /2 appearing after /1 in the ILLUMINACLIP file?
The reads are paired by name - PrefixX/1 goes with PrefixX/2, PrefixY/1 goes with PrefixY/2. Order within the adapter file is ignored.

And most important: the prefix sequences are the sequences effectively ligated 'before' that read, not the sequence which is found within that read (which is always the 'opposite' adapter in a read-through scenario). In palindrome mode, Trimmomatic does an 'in silico' ligation of the prefixes, and attempts to semi-globally align the resulting forward and reverse 'prefix+read' sequences.

Quote:
In my case, should I have included a /2 sequence expected to appear in the reverse read for each of my /1 sequences?
Indeed - palindrome mode requires 'matched' pairs of prefix sequences. Since illumina pairs are almost always of equal length, both adapters should be present in such pairs, and thus the read-through scenario recognised with greater confidence.

Quote:
I hope the questions weren't too unclear?
Not at all, but apparently my manual page needs work

Last edited by tonybolger; 03-27-2012 at 05:01 AM.
tonybolger is offline   Reply With Quote
Old 03-27-2012, 05:31 AM   #10
Palgrave
Member
 
Location: norway

Join Date: Aug 2011
Posts: 73
Default

I am using cutadapt to remove adapters. Should I expect adapters at both ends or just at the 3' end of my paired-end reads? I got 2,5% adapters when trimming the 3' end.
Palgrave is offline   Reply With Quote
Old 03-28-2012, 12:48 PM   #11
vinay052003
Member
 
Location: Atlanta, US

Join Date: Jan 2010
Posts: 59
Default

I am not sure about other technilogies, but for Illumina 5' end sequencing cycle starts right from the start (5' end) of the actual sequence. So adapters sequence contamination in the final read would be only on 3' end.
vinay052003 is offline   Reply With Quote
Old 10-04-2012, 02:09 PM   #12
safay
Junior Member
 
Location: Berkeley, CA

Join Date: Jun 2011
Posts: 2
Default

Quote:
Originally Posted by tonybolger View Post
The reads are paired by name - PrefixX/1 goes with PrefixX/2, PrefixY/1 goes with PrefixY/2. Order within the adapter file is ignored.

And most important: the prefix sequences are the sequences effectively ligated 'before' that read, not the sequence which is found within that read (which is always the 'opposite' adapter in a read-through scenario). In palindrome mode, Trimmomatic does an 'in silico' ligation of the prefixes, and attempts to semi-globally align the resulting forward and reverse 'prefix+read' sequences.


Indeed - palindrome mode requires 'matched' pairs of prefix sequences. Since illumina pairs are almost always of equal length, both adapters should be present in such pairs, and thus the read-through scenario recognised with greater confidence.


Not at all, but apparently my manual page needs work
Tony, truly I appreciate your help on this forum. I still am confused about which adapter should be assigned to each /1 and /2 in the palidromic mode. Could you please provide a sample file, even if it is just a mock without the real adapters, of how to construct the contaminants.fa file?

Am I correct in thinking that, with Illumina TruSeq libraries, the first round of sequencing could potentially yield a read through to the reverse complement of the indexed adapter? The second round of sequencing could potentially yield a read-through to the reverse complement of the universal adapter? Please correct me if my understanding of the technology is incorrect.

Does this mean that we should make a unique universal adapter sequence for each index, i.e., copy and rename it for each indexed adaptor?

example:

>PrefixIndex1/1
NNNNNNNNNNNNNNNNNNNN <------ where this is the reverse complement of the index1 adaptor
>PrefixIndex1/2
NNNNNNNNNNNNNNNN <------- where this is the reverse complement of the universal adapter
>PrefixIndex2/1
NNNNNNNNNNNNNNNNNNNN <------ where this is the reverse complement of the index2 adaptor
>PrefixIndex2/2
NNNNNNNNNNNNNNNN <------- where this is the reverse complement of the universal adapter

...

and so on, for each of the indexed adapters?

And then, for each of the adaptors, do we also need to include a separate set if we want to search for them in "simple" mode?
safay is offline   Reply With Quote
Old 10-07-2012, 11:06 PM   #13
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by safay View Post
Tony, truly I appreciate your help on this forum. I still am confused about which adapter should be assigned to each /1 and /2 in the palidromic mode. Could you please provide a sample file, even if it is just a mock without the real adapters, of how to construct the contaminants.fa file?
The suggested pair for TruSeq3 is:
>PrefixPE/1
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
but i would also suggest lowering the palindrome threshold from 40 to 30 (since the adapters from the previous protocol were longer, a higher threshold could be reached).

Quote:
Originally Posted by safay View Post
Am I correct in thinking that, with Illumina TruSeq libraries, the first round of sequencing could potentially yield a read through to the reverse complement of the indexed adapter? The second round of sequencing could potentially yield a read-through to the reverse complement of the universal adapter? Please correct me if my understanding of the technology is incorrect.
This is correct, though i haven't had a coffee yet.

Quote:
Originally Posted by safay View Post
Does this mean that we should make a unique universal adapter sequence for each index, i.e., copy and rename it for each indexed adaptor?
It's not actually necessary - using just the 'common' part of all the indexed adapters (between the 'useful' DNA and the index) seems to be sufficient.

Quote:
Originally Posted by safay View Post
And then, for each of the adaptors, do we also need to include a separate set if we want to search for them in "simple" mode?
Yes, but there's still the 'if you want to search for them' part.

It probably makes sense to search for the pcr primer sequences, but i'm not sure what other technical sequences occur regularly. I would suggest looking for over-represented sequences using say FastQC, and trimming relatively selectively, rather than adopting a brute-force strategy.

It's a balance - removing valid data which looks a bit like a technical sequence (whether caused by having far too many technical sequences or thresholds too low) is at least as bad as leaving true technical sequences in there, since you potentially lose all coverage of a specifc region.
tonybolger is offline   Reply With Quote
Old 03-22-2013, 11:36 PM   #14
azleen
Junior Member
 
Location: Malaysia

Join Date: Mar 2013
Posts: 2
Default Trim adapter index using CLCBio

Does anyone here have experienced overrepresented sequences by Tru-seq adapter in their samples data? what to do with this data? should trim the adapter or not? if yes, how? I'm using CLCBio.
azleen is offline   Reply With Quote
Old 03-24-2013, 12:05 PM   #15
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by azleen View Post
Does anyone here have experienced overrepresented sequences by Tru-seq adapter in their samples data?
This is a pretty normal situation, especially for libraries with short insert sizes, and/or less than perfect size selection.

Quote:
Originally Posted by azleen View Post
what to do with this data? should trim the adapter or not?
Pretty much every possible use of NGS benefits from trimming adapters.

Quote:
Originally Posted by azleen View Post
if yes, how? I'm using CLCBio.
Presumably it supports it, but you also have the choice of many free tools
tonybolger is offline   Reply With Quote
Old 04-09-2013, 11:21 AM   #16
aprice67
Member
 
Location: New York

Join Date: Nov 2012
Posts: 49
Default

Hi, so I'm working with some similar data. Something I found is that alot of trimming tools aren't really set up for paired end stuff. I have a pipeline for trimming and aligning reads. It goes basically like this:


//There are first two files, paired end illumina. This removes all the ones that failed basic quality checks. Outputs to Filtered
grep -A 3 '^@.* [^:]*:N:[^:]*:' $INPUT1 > $FILTERED1
grep -A 3 '^@.* [^:]*:N:[^:]*:' $INPUT2 > $FILTERED2

//This tool is good for dealing with paired end reads. Best that I could find for paired end trimming. I don't remember all the parameters but theres a great resource out there describing this tool.
fastq-mcf -o $OUTPUT1 -o $OUTPUT2 -l 16 -q 15 -w 4 -x 10 -u -P 33 $ADAPTERS $FILTERED1 $FILTERED2

//This aligns using bowtie and gets a samfile made.
bowtie -t -p 8 --sam $REF_GENOME -1 $OUTPUT1 -2 $OUTPUT2 $ALIGNED_OUTPUT

//This makes a sorted bam file from our bowtie alignment, which can be used for all sorts of things.
samtools view -bS $ALIGNED_OUTPUT | samtools sort - $SORTED_BAM
samtools index $SORTED_BAM.bam $SORTED_BAM.bam.bai



That's pretty much how I'm doing it for my data. It works pretty well. As for those nasty overrepresented sequences. I'm guessing you're doing quality assessment with fastqc, which is a great tool. In my case, I did RNA-seq on bacterial genomes, so my read depth is really really high, because the genome is small. Add to that some highly expressed genes and you get queues for highly represented sequences. I'm basically ignoring them in my data, but think about how overrepresented sequences apply to your data and how bad or not important they really are.

Hope this helps.
aprice67 is offline   Reply With Quote
Old 05-02-2017, 07:58 PM   #17
candida
Junior Member
 
Location: Singapore

Join Date: Oct 2010
Posts: 2
Default

Hello everyone,


I am working with TruSeq paired end data (150bp). I have a doubt regarding the adapter file provided in Trimmomatic for trimming adapters.

According to the Trimmomatic provided adapter file "TruSeq3-PE-2.fa" the reverse complement of index adapter sequence is used for trimming reads from R2 file and the universal adapter is used for trimming reads from R1 file.
>PrefixPE/1 TACACTCTTTCCCTACACGACGCTCTTCCGATCT

>PrefixPE/2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

>PE1 TACACTCTTTCCCTACACGACGCTCTTCCGATCT

>PE1_rc AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA

>PE2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

>PE2_rc AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

However, it looks like that for my data the actual sequences of the index adapter is in the R1 file and the reverse complement of the universal adapter is in the R2 file.

This information was also provided to me by Illumina support team.
https://support.illumina.com/bulleti...-trimming.html

Therefore I prepared my adapter file as follows (I'm using the full sequence):
>PrefixPE/1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG (index adapter)

>PrefixPE/2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT ( reverse complement of universal adapter)

>PE1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG

>PE1_rc CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (revcomp of PE1)

>PE2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

>PE2_rc AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT (revcomp of PE2)

Please let me know if this adapter file I prepared is fine or is the Trimmomatic adapter file better and needs to be used always.
I tried my custom made file as well as the Trimmomatic recommended file and found that both removed adapters when checked using FASTQC!!

Please correct me or let me know if I'm missing something!
Appreciate your help and guidane!
Thanks,
Candida

Last edited by candida; 05-03-2017 at 12:32 AM. Reason: Delete Post
candida is offline   Reply With Quote
Reply

Tags
adapter rna-seq sra

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO