SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trimmomatic quality trimming kga1978 Bioinformatics 26 11-24-2015 10:14 AM
Trimmomatic error while executing Irina Pulyakhina Bioinformatics 15 07-03-2015 04:44 AM
Problem with trimmomatic amango Bioinformatics 9 12-29-2013 08:43 AM
Introducing pBWA [Parallel BWA] dp05yk Bioinformatics 52 05-21-2013 10:27 PM
Introducing our Ion Torrent! nickloman Ion Torrent 34 05-26-2011 05:56 PM

Reply
 
Thread Tools
Old 10-03-2012, 03:48 PM   #21
Niloofar
Junior Member
 
Location: Australia

Join Date: May 2012
Posts: 7
Smile

OK, Thanks Westerman
Niloofar is offline   Reply With Quote
Old 12-10-2012, 02:13 PM   #22
alisrpp
Member
 
Location: New York

Join Date: Dec 2010
Posts: 40
Default

I have some questions about the folder with the sequences of the adapters:

1. Where i have to create the folder?
2. Can anyone give me an example of the format? I'm using Illumina TruSeq adapters.

Thanks.
alisrpp is offline   Reply With Quote
Old 12-20-2012, 09:01 AM   #23
alisrpp
Member
 
Location: New York

Join Date: Dec 2010
Posts: 40
Default

Hi all,

I started to use Trimmomatic 0.22 three weeks ago and i'm having some difficulties.

Until now i was doing the adapter trimming and quality trimming with CLC. In order to check which method is going to give me the better quality of trimmed reads i'm doing again the trimming of my libraries with Trimmomatic and comparing the fastQC reports of both (CLC and Trimmomatic).

For some reason Trimmomatic is not doing a very good job with my forward sequences and i'm still having some overrepresented sequences that the fastQC recognize as one of the Illumina indexes.

I thought that maybe was because my adapters fasta file was wrong so i was playing around a little bit with it.

Now i'm completely lost and exhausted so i need help. Here i attached my 2 adapters fasta files. Because i built my libraries with 2 different kits from Illumina (the Multiplexing Sample Prep Oligo Only kit: with homemade adapters and with Illumina adapters; and the TruSeq kit).

And here is my script:

for FRW in $(ls *_R1_005.fastq.gz); do
for RVS in $(ls *_R2_005.fastq.gz); do
bsub -q $queue java -Xincgc -Xms4g -Xmx4g -classpath ../../trimmomatic-0.22/Trimmomatic-0.22/trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 $FRW $RVS ../trimmomatic_data_trial/$FRW.FRWPE.fq.gz ../trimmomatic_data_trial/$FRW.FRWUnP.fq.gz ../trimmomatic_data_trial/$RVS.RVSPE.fq.gz ../trimmomatic_data_trial/$RVS.RVSUnP.fq.gz ILLUMINACLIP:../../trimmomatic-0.22/TruSeq_adapters.fa:2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:20


THANKS!!!
Attached Files
File Type: txt IlluminaPE.txt (1,022 Bytes, 65 views)
File Type: txt TruSeq_adapters.txt (1.3 KB, 64 views)
alisrpp is offline   Reply With Quote
Old 12-20-2012, 09:19 AM   #24
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Well, your script is confusing. It looks like you'll be comparing multiple forward files to multiple reverse files without any correspondence between the forward and reverse sequences. You then regenerate the same output files over and over again with the last files being generated good for the reverse sequences but not the forward ones.

Now if you only have one matching file for your 'ls' the above observation doesn't really matter. But if you have multiple files then ... let's see.

Assume the following files:
A_R1_005.fastq.gz
B_R1_005.fastq.gz
C_R2_005.fastq.gz
D_R2_005.fastq.gz

Which I'll abbreviate as 'A', 'B', 'C', and 'D.

A vs C produces files:
A.FRWPE.fq.gz A.FRWUnP.fq.gz C.RVSPE.fq.gz C.RVSUnP.fq.gz

A vs D produces files: (bad match between R1 and R2)
A.FRWPE.fq.gz A.FRWUnP.fq.gz D.RVSPE.fq.gz D.RVSUnP.fq.gz

B vs C produces files: (bad match between R1 and R2)
B.FRWPE.fq.gz B.FRWUnP.fq.gz C.RVSPE.fq.gz C.RVSUnP.fq.gz

B vs D produces files:
B.FRWPE.fq.gz B.FRWUnP.fq.gz D.RVSPE.fq.gz D.RVSUnP.fq.gz

Everything is confused!

Now it is quite possible that your script is working ok (either because of single files or because it was not presented completely) but you may wish to run your Trimmomatic without the loops involved.
westerman is offline   Reply With Quote
Old 12-20-2012, 10:36 AM   #25
alisrpp
Member
 
Location: New York

Join Date: Dec 2010
Posts: 40
Default

Hi westerman,

Thanks for the answer.

So, i only have one forward file (R1) and one reverse file (R2) so i think that even is better to not to use loops the script is not the problem. Before using this script with the loops i was using this:

java -Xincgc -Xms4g -Xmx4g -classpath ../../trimmomatic-0.22/Trimmomatic-0.22/trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 [name_forward_file] [name_reverse_file] ../trimmomatic_data_trial/[name_forward_file]_PE.fq.gz ../trimmomatic_data_trial/[name_forward_file]_UnP.fq.gz ../trimmomatic_data_trial/[name_reverse_file]_PE.fq.gz ../trimmomatic_data_trial/[name_reverse_file]_UnP.fq.gz ILLUMINACLIP:../../trimmomatic-0.22/TruSeq_adapters.fa:2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:20

... and i was having the same problem with my reverse trimmed sequences. That's why i thought that the problem is in my adapters fasta file.
alisrpp is offline   Reply With Quote
Old 12-20-2012, 11:18 AM   #26
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,167
Default

Your adapter fasta files have /2 after the IDs of the index adapters which tells Trimmomatic to only check these sequences against the second (reverse) read. These adapters will show up in the first (forward) read, not the second. Likewise the universal adapter sequence will show up in the second read, not the first. You need to swap the /1 and /2 endings in your adapter FASTA files and you should use the reverse complement of the Universal adapter. You could remove /1 and /2 entirely forcing Trimmomatic to check the adapters against all reads but that is inefficient.
kmcarr is offline   Reply With Quote
Old 12-20-2012, 11:19 AM   #27
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Ok, having gotten the possible script problem out of the way, everything else looks syntactically correct. Time to dig into the adapter file or the actual reads that are not being properly trimmed. It is possible that the 2:40:15 parameter to the ILLUMINACLIP is tripping you up. Are you using the simple trim or the, in my experience more rare, palindrome trim? You say:

Quote:
i'm still having some overrepresented sequences that the fastQC recognize as one of the Illumina indexes.
What are these sequences?

Can you look at a handful of reads reported to have illumina indexes and manually see if they have the index. And, if so, how good is the match?
westerman is offline   Reply With Quote
Old 12-20-2012, 11:24 AM   #28
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by kmcarr View Post
Your adapter fasta files have /2 after the IDs of the index adapters which tells Trimmomatic to only check these sequences against the second (reverse) read. ...
Ah, good catch! All my adapters are non-strand specific with reverse-complements thus I just skipped over that part of his adapters. It is undoubtedly the correct solution.
westerman is offline   Reply With Quote
Old 12-20-2012, 11:38 AM   #29
alisrpp
Member
 
Location: New York

Join Date: Dec 2010
Posts: 40
Default

Thanks a lot!!!

Do you recommend me to do an individual FASTA file for each index?
alisrpp is offline   Reply With Quote
Old 12-20-2012, 11:45 AM   #30
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

All my adapters are in one file. Given the number of adapters I can not see it working otherwise. I undoubtedly over process my sequences by looking at both strands in both directions with all possible adapters -- as kmcarr says, "inefficient" -- but it makes me more comfortable that I am picking up everything.
westerman is offline   Reply With Quote
Old 12-20-2012, 11:47 AM   #31
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,167
Default

Quote:
Originally Posted by alisrpp View Post
Thanks a lot!!!

Do you recommend me to do an individual FASTA file for each index?
No, I use a single file with all of the Illumina TruSeq sequences in it. I also don't bother designating any of them for "Palindrome" search. I've read the description of Palindrome search several times on the Trimmomatic site and frankly still don't understand it. I just do simple searches for them and it seems to work fine for me.

I have attached the file I use with Trimmomatic.
Attached Files
File Type: zip TruSeqForTimmomatic.fna.zip (1,002 Bytes, 89 views)
kmcarr is offline   Reply With Quote
Old 12-20-2012, 11:59 AM   #32
alisrpp
Member
 
Location: New York

Join Date: Dec 2010
Posts: 40
Default

Thansk for the answers!

About the palindrome clipping, some days ago i wrote one of the creators of Trimmomatic asking about an alternative explanation to the one in the web site (i couldn't understand it either).
Here is the answer, for me was useful:

Quote:
Simple clipping is just finding a contaminant sequence somewhere within a read. Conceptually, you get contaminant and read, and you slide them across each other, until you get a perfect or close enough match. So, with R being read bases, and C being contaminant, you check

1)
RRRRRRRRRRR
CCCC

2)
RRRRRRRRRRR
CCCC ->

etc.

Palindrome clipping is a bit more complex - and related to actual palindromes only in a twisted mind like mine. In this case, you 'ligate' the presumed adapter sequence to the start of each read in a pair, and try sliding them over each other.

So with F being bases from the forward read, R being bases from the reverse read, and A being either adapter (technically the two adapters are different, but lets ignore that for now).

AAAAAAFFFFFFF ->
<- RRRRRRRAAAAAA

In this case, the aligning region is much longer, since it consists of the entire read length plus part of the adapter. This gives a very high confidence that an apparent 'read-though' is a true-positive.
alisrpp is offline   Reply With Quote
Old 12-20-2012, 12:06 PM   #33
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,167
Default

Quote:
Originally Posted by alisrpp View Post
Here is the answer, for me was useful:

Quote:
Simple clipping is just finding a contaminant sequence somewhere within a read. Conceptually, you get contaminant and read, and you slide them across each other, until you get a perfect or close enough match. So, with R being read bases, and C being contaminant, you check

1)
RRRRRRRRRRR
CCCC

2)
RRRRRRRRRRR
CCCC ->

etc.

Palindrome clipping is a bit more complex - and related to actual palindromes only in a twisted mind like mine. In this case, you 'ligate' the presumed adapter sequence to the start of each read in a pair, and try sliding them over each other.

So with F being bases from the forward read, R being bases from the reverse read, and A being either adapter (technically the two adapters are different, but lets ignore that for now).

AAAAAAFFFFFFF ->
<- RRRRRRRAAAAAA

In this case, the aligning region is much longer, since it consists of the entire read length plus part of the adapter. This gives a very high confidence that an apparent 'read-though' is a true-positive.
Yeah, still clear as mud.
kmcarr is offline   Reply With Quote
Old 03-22-2013, 03:05 PM   #34
claire.anderson1
Junior Member
 
Location: Australia

Join Date: Mar 2013
Posts: 1
Default How does adapter trimming in Trimmomatic work?

I have two adapter sequences of 58 bp and 66 bp that I would like to remove from my Illumina data set (if present). Can Trimmomatic recognise partial matches to these adapter sequences? For example, if I am using 100 bp reads and a particular sequence contains 90 bp of DNA from the source organism, the remaining 10 bp at the end of the read might be from the adapter. Would Trimmomatic be able to pick this up? Or must it find a match to the whole adapter sequence?

I'm new at playing with NGS data, so any advice would be gratefully received!
claire.anderson1 is offline   Reply With Quote
Old 03-22-2013, 03:30 PM   #35
cllorens
Member
 
Location: Valencia

Join Date: Nov 2011
Posts: 44
Default

Maybe you can also check out cutadapt, that it is also useful for illumina data.

http://code.google.com/p/cutadapt/
cllorens is offline   Reply With Quote
Old 03-24-2013, 05:43 AM   #36
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by claire.anderson1 View Post
I have two adapter sequences of 58 bp and 66 bp that I would like to remove from my Illumina data set (if present). Can Trimmomatic recognise partial matches to these adapter sequences? For example, if I am using 100 bp reads and a particular sequence contains 90 bp of DNA from the source organism, the remaining 10 bp at the end of the read might be from the adapter. Would Trimmomatic be able to pick this up? Or must it find a match to the whole adapter sequence?
In the case of paired-end data with adapter 'read-though' (where the DNA fragment is less than the read length, and the end of the reads are from the 'opposite' adapter), trimmomatic can remove even a single adapter base (if you use sufficiently aggressive settings). Older versions of trimmomatic required at least 8 bp of adapter in this case, but that was probably too conservative so i reduced it. The latest versions also include the recommended adapter sequences, which have been a common stumbling point.

For other, less common, scenarios, where the adapter location/orientation isn't known in advance, or where you're using single end data, you'd typically want to be a bit more cautious, but 10bp or greater can usually be removed at a reasonable false positive rate.

Hope this helps.
tonybolger is offline   Reply With Quote
Old 03-24-2013, 06:22 AM   #37
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by kmcarr View Post
Yeah, still clear as mud.
Sorry that my explanation for this obviously sucks, and now that the adapter sequences are included directly in trimmomatic, there's probably not such a major need for everyone to understand it, but here goes anyway.

During adapter read-though, with paired end data (and assuming the same length of forward and reverse reads) we get pairs with:
  • The forward read consisting of X useful bases, followed by Y bases from the end of the reverse read adapter.
  • The reverse read consisting of X useful bases, followed by Y bases from the end of the forward read adapter.
The beauty is that those X bases in both the forward and reverse reads, are the same bases, though in reverse complement, and those Y bases are always specific known sequences starting immediately afterwards. So rather than fish for those Y bases in isolation (which is risky / difficult if Y is small), we can check simultaneously for 3 things:
  • The first X bases of both reads being reverse complements of each other.
  • The additional bases from the forward read match the reverse adapter.
  • The additional bases from the reverse read match the forward adapter.
Since all three must be found to support the 'read-though' hypothesis in a given read pair/position, the false positive rate is very low. Naturally we don't know what X is, but we can check every possible X from zero to the read length.
tonybolger is offline   Reply With Quote
Old 04-23-2013, 02:34 PM   #38
leda
Junior Member
 
Location: California

Join Date: Feb 2013
Posts: 7
Default

What do the four columns following the read identifier in the trimlog represent? I can't find this in the documentation.

thanks!
leda is offline   Reply With Quote
Old 04-23-2013, 03:18 PM   #39
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default Introducing the Trimmomatic

This is an extract from the trimmomatic web page:

specifying a trimlog file creates a log of all read trimmings, indicating the following details:

* the read name
* the surviving sequence length
* the location of the first surviving base, aka. the amount trimmed from the start
* the location of the last surviving base in the original read
* the amount trimmed from the end

http://www.usadellab.org/cms/index.php?page=trimmomatic
mastal is offline   Reply With Quote
Old 05-16-2013, 01:09 AM   #40
helios
Junior Member
 
Location: Milan

Join Date: Nov 2010
Posts: 1
Default Make trimmomatic a binary/executable

Hi Guys,

in case you prefer to run trimmomatic as binary ./trimmomatic

you can follow these steps:

1) download and gunzip stub.sh.gz (in attachment) where trimmomatic-0.X.jar is located
2) cat stub.sh trimmomatic-0.30.jar >> trimmomatic
3) chmod +x trimmomatic
4) add trimmomatic's home to your path

ref: https://coderwall.com/p/ssuaxa

in case you need to modify java's parameters we must modify stub.sh opportunely.

Ciao.
Attached Files
File Type: gz stub.sh.gz (185 Bytes, 15 views)
helios is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:53 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO