SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bwa quality trimming and samtools rmdup greigite Bioinformatics 7 02-22-2013 03:22 AM
newbler quality trimming Himalaya 454 Pyrosequencing 2 08-22-2012 04:44 AM
BWA errors after quality trimming flipwell Bioinformatics 3 05-21-2012 07:39 PM
Looking for an perl script to quality trimming data tarias Bioinformatics 4 09-21-2011 10:24 PM
Quality trimming fastq 454 data? ewilbanks Bioinformatics 2 04-14-2011 10:45 PM

Reply
 
Thread Tools
Old 12-06-2011, 06:04 PM   #1
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default Trimmomatic quality trimming

I have been using Trimmomatic to trim adapters and quality scores. In general, I have been pleased with the performance, but I just ran some low quality samples through and Trimmomatic doesn't appear to be trimming correctly based on quality? In certain cases I even see the per base quality being worse after trimming than before. I have set my cutoff to '10', so I would expect everything below that to be cut off. Furthermore, I have specified a sliding-window minimum quality of '18'. A couple of examples:

Before:


After:


Before:


After:


In each case I ran the following commands:
Code:
java -Xmx2g -classpath /usr/local/bin/trimmomatic/trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 sample.fastq.gz sample.trimmed.fastq ILLUMINACLIP:/Volumes/Storage_1/Sequencing_1/References/Contaminants/contaminants.fasta:2:40:12 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:18 MINLEN:18
kga1978 is offline   Reply With Quote
Old 12-08-2011, 12:51 AM   #2
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

I tried running these samples through PrinSeq and cutadapt as well with very similar results. This means that the problem isn't specific to Trimmomatic, but I'm still interested to hear if anybody knows what is causing this? I guess it only happens on really low-quality reads?
kga1978 is offline   Reply With Quote
Old 12-08-2011, 03:03 AM   #3
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by kga1978 View Post
I have been using Trimmomatic to trim adapters and quality scores. In general, I have been pleased with the performance, but I just ran some low quality samples through and Trimmomatic doesn't appear to be trimming correctly based on quality?
Strange indeed.

Is your data really phred33 as suggested in the command line? Illumina 1.5 is normally phred64.
tonybolger is offline   Reply With Quote
Old 12-08-2011, 10:43 AM   #4
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

To be perfectly honest, I'm not sure - the quality score thing is doing my head in (damn you, Illumina!). I assumed if it was phred64, my maximum score would be higher than 40, no? I'll try and rerun with phred64 and see what happens.
kga1978 is offline   Reply With Quote
Old 12-09-2011, 01:29 AM   #5
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by kga1978 View Post
To be perfectly honest, I'm not sure - the quality score thing is doing my head in (damn you, Illumina!). I assumed if it was phred64, my maximum score would be higher than 40, no? I'll try and rerun with phred64 and see what happens.
If the data really is phred-64 but trimmomatic is told that it is phred33, trimmomatic will interpret each score as 31 higher than it really is - thus not really trimming much since the quality appears 'excellent'. I really should add a warning if the quality scores are outside the expected range, as this is nearly always caused by wrong phred-33/phred-64 selection, and results in either no trimming, or almost everything trimmed, depending on the direction of the mistake.

In any case, you really shouldn't see a significant percentage of the reads with base calls much below the sliding window threshold - e.g. in fastQC, the yellow bars should mostly be above, but the whiskers will tend to be below. On really bad data, you might also see the yellow bars drop in the last few bases, an artefact of 'under-testing' as the sliding window runs off the end of the reads - this is to be expected.

Here's an example of some really low quality data pre/post trimming, using sliding window 4 wide, quality 15.

Untrimmed Forward:

Untrimmed Reverse:

Trimmed Forward Paired:

Trimmed Forward Unpaired:

Trimmed Reverse Paired:

Trimmed Reverse Unpaired:
tonybolger is offline   Reply With Quote
Old 12-09-2011, 02:42 AM   #6
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Hi Tony,

Got it. I reran some of the reads and most of them got better with phred64 (I mostly use trimming for adapters though - my aligner takes into consideration quality). However, as you said, really bad reads still fall off dramatically in the end - probably due to the sliding window. So, just to be clear - am I correct in the following?

Casava 1.3 - 1.7: Use Phred64
Casava 1.8+: Use Phred33
454 data (although Trimmomatic can't do this right now): Use Phred33

Thanks for following up.
kga1978 is offline   Reply With Quote
Old 12-09-2011, 04:14 AM   #7
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by kga1978 View Post
Got it. I reran some of the reads and most of them got better with phred64 (I mostly use trimming for adapters though - my aligner takes into consideration quality). However, as you said, really bad reads still fall off dramatically in the end - probably due to the sliding window.
How far in do you see the low bases, i.e below the threshold cut-off? Just the last few? Do your new plots look anything like the ones i posted?

Quote:
Originally Posted by kga1978 View Post
So, just to be clear - am I correct in the following?

Casava 1.3 - 1.7: Use Phred64
Casava 1.8+: Use Phred33
454 data (although Trimmomatic can't do this right now): Use Phred33
I believe so - though generally i verify by looking at the scores by eye, and checking here. Occasionally i've seen data in the 'wrong' phred because someone decided to be 'helpful'
tonybolger is offline   Reply With Quote
Old 12-09-2011, 05:15 AM   #8
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Actually, it's all good - the one that had a dramatic drop-off in the end, I had forgotten to change to phred64!

This is what the data looks like now:
kga1978 is offline   Reply With Quote
Old 08-02-2013, 02:36 AM   #9
aforntacc
Member
 
Location: italy

Join Date: Jun 2011
Posts: 48
Default

Quote:
Originally Posted by tonybolger View Post
If the data really is phred-64 but trimmomatic is told that it is phred33, trimmomatic will interpret each score as 31 higher than it really is - thus not really trimming much since the quality appears 'excellent'. I really should add a warning if the quality scores are outside the expected range, as this is nearly always caused by wrong phred-33/phred-64 selection, and results in either no trimming, or almost everything trimmed, depending on the direction of the mistake.

In any case, you really shouldn't see a significant percentage of the reads with base calls much below the sliding window threshold - e.g. in fastQC, the yellow bars should mostly be above, but the whiskers will tend to be below. On really bad data, you might also see the yellow bars drop in the last few bases, an artefact of 'under-testing' as the sliding window runs off the end of the reads - this is to be expected.

Here's an example of some really low quality data pre/post trimming, using sliding window 4 wide, quality 15.

Untrimmed Forward:

Untrimmed Reverse:

Trimmed Forward Paired:

Trimmed Forward Unpaired:

Trimmed Reverse Paired:

Trimmed Reverse Unpaired:

ok, i get this part very well, but my question is please if i want to use tophat for mapping which of these files should i use? (forward paired and reverse paired) what about the unpaired. i am new to trimmomatic and tophat sorry if this seems a stupid question.
thanks in advance
aforntacc is offline   Reply With Quote
Old 08-02-2013, 03:37 AM   #10
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default Trimmomatic quality trimming

I don't think Tophat and Bowtie will let you use paired reads and unpaired reads in the same run, so you would have to do 2 runs, one with the R1_paired.fastq and R2_paired.fastq files, and another run with the files containing the R1_unpaired.fastq and R2_unpaired.fastq reads.
mastal is offline   Reply With Quote
Old 08-13-2013, 10:39 PM   #11
ebioman
Member
 
Location: Switzerland

Join Date: Aug 2013
Posts: 41
Default How is quality score evaluated ?

Hi
I wondered whether anybody can explain me how the quality scores of the program
are actually calculated.
E.g. for the Lead-Trimming using a often cited value of 3 - obviously that won't be phred score. So what is it ?
ebioman is offline   Reply With Quote
Old 08-14-2013, 12:17 AM   #12
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by ebioman View Post
Hi
I wondered whether anybody can explain me how the quality scores of the program
are actually calculated.
E.g. for the Lead-Trimming using a often cited value of 3 - obviously that won't be phred score. So what is it ?
It's a phred score

Historically, the illumina pipeline occasionally created reads with one (or more rarely two) N base-calls at the start, and more often, a set of trailing B phred quality scores at the end. N-base calls are treated as zero phred score, and B are quality 2, so by trimming both ends for all scores below 3, these artefacts are removed.
tonybolger is offline   Reply With Quote
Old 08-14-2013, 12:20 AM   #13
ebioman
Member
 
Location: Switzerland

Join Date: Aug 2013
Posts: 41
Default

Thanks that was as short as informative ! I always thought it might be some other internal scores and tried desperately to reveal its calculation
ebioman is offline   Reply With Quote
Old 03-04-2015, 09:22 AM   #14
Laine
Junior Member
 
Location: Brazil

Join Date: Mar 2015
Posts: 1
Default

Quote:
Originally Posted by ebioman View Post
Thanks that was as short as informative ! I always thought it might be some other internal scores and tried desperately to reveal its calculation
I had just the same doubt!! Very informative indeed...
Laine is offline   Reply With Quote
Old 09-07-2015, 05:36 PM   #15
trimmoMe
Junior Member
 
Location: boston

Join Date: Sep 2015
Posts: 2
Smile need help with trimmomatics

Hi everyone,

I am have been having some issues with my command line for trimmomatics,

this is what ive been using:
java -jar /Users/omriadini/Desktop/Trimmomatic-0.33/trimmomatic-0.33.jar SE -threads 4 -trimlog /Users/omriadini/Desktop/156\ L001/L002.trimLog /Volumes/omri\ hard\ drive/Ally\'s\ stuff/Liron\'s\ Project/Raw\ Data/NRF2-1_S17/NRF2-1_S17_L001_R1_001.fastq trimmed.NRF2-1_S17_L001_R1_001.fastq ILLUMINACLIP:/Users/omriadini/Desktop/156\ L001/Truseq_NEBnext_adapter_sequences\ \(1\).txt:2:30:10 HEADCROP:12 MAXINFO:0:40:0.5 MINLEN:36

however, this is the response i get everytime:
ILLUMINACLIP: Using 0 prefix pairs, 48 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Reads: 7877072 Surviving: 0 (0.00%) Dropped: 7877072 (100.00%)
TrimmomaticSE: Completed successfully

I am not sure why it keeps dropping all of my reads,

any ideas?

thanks in advance
trimmoMe is offline   Reply With Quote
Old 09-08-2015, 07:46 AM   #16
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I don't recognize the 'MAXINFO' parameter. Perhaps that is the problem?

In any case was does your trimLog say? Posting the first 10 lines could be useful.
westerman is offline   Reply With Quote
Old 09-08-2015, 07:51 AM   #17
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

... And I answered my own question about MAXINFO. It is a (newer) parameter that I don't use. However the manual I am looking at says that there are only two numbers associated with MAXINFO: <targetLength>:<strictness> You seem to have three numbers 0 ... 40 ... 0.5. That could be a problem.
westerman is offline   Reply With Quote
Old 09-08-2015, 11:29 AM   #18
trimmoMe
Junior Member
 
Location: boston

Join Date: Sep 2015
Posts: 2
Default

that solved it...
Thanks so much weterman!!
trimmoMe is offline   Reply With Quote
Old 09-16-2015, 06:38 AM   #19
FSeq
Junior Member
 
Location: koln

Join Date: Sep 2015
Posts: 1
Default

Quote:
Originally Posted by tonybolger View Post
It's a phred score

Historically, the illumina pipeline occasionally created reads with one (or more rarely two) N base-calls at the start, and more often, a set of trailing B phred quality scores at the end. N-base calls are treated as zero phred score, and B are quality 2, so by trimming both ends for all scores below 3, these artefacts are removed.
Does that mean that if we do LEADING:3 TRAILING:3 we only remove those artifacts? So, if we want to remove the bases with a phred score below 34, for example, we should write LEADING:34 TRAILING:34? What is the accepted phred score to keep bases? I noticed that for SLIDINGWINDOW, the parameters are 4:15, but I wondered if 15 was not too low as quality score? Thanks a lot!
FSeq is offline   Reply With Quote
Old 09-16-2015, 07:16 AM   #20
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by FSeq View Post
Does that mean that if we do LEADING:3 TRAILING:3 we only remove those artifacts? So, if we want to remove the bases with a phred score below 34, for example, we should write LEADING:34 TRAILING:34? What is the accepted phred score to keep bases? I noticed that for SLIDINGWINDOW, the parameters are 4:15, but I wondered if 15 was not too low as quality score? Thanks a lot!
It would depend on your application. Trimming, in general, is only required for de novo work -- assembly -- and not for mapping since mapping programs will tend to drop reads or parts of reads that do not match because of poor quality. Even some de novo assembly programs -- Mira comes to mind -- want untrimmed reads.

With that said, I would lead/trail remove quality 2 or less and slide window at the default 15. That will be a good starting point for most applications -- gets rid of the really bad stuff without knocking out potentially useful bases.
westerman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO