Seqanswers Leaderboard Ad

**felvis56** · 06-18-2015, 04:58 AM

I wanted to ask a question about the quality trimming of sequences. I have Illumina reads and use Trim Galore to remove adapters and primers with good success.

Is the quality trimming based on the average Phred score of the read or if I use a cutoff of 25 and have any/some bases below 25, will the whole read be removed?

Thanks,

Fiona

**fkrueger** · 06-18-2015, 06:05 AM

Hi Fiona,

quality trimming removes the portion of the read where the qualities become minimal, but does not remove then entire read (pair) completely. This is taken from Cutadapt --help:

Code:

 --quality-base=QUALITY_BASE
                        Assume that quality values are encoded as
                        ascii(quality + QUALITY_BASE). The default (33) is
                        usually correct, except for reads produced by some
                        versions of the Illumina pipeline, where this should
                        be set to 64. (Default: 33)

**felvis56** · 06-18-2015, 06:14 AM

Thank you for the fast reply.

Apologies if I am being basic. If there is a single base with Q<25 but the following bases are ok will the read be cut at that point or is there a set number of bases needed to be below Q<20 resulting in the read being cut?

When I look at my fastq files on FastQC the error bars do dip below 20 but I was thinking this was due to a small number of bases over multiple reads.

Thanks

**fkrueger** · 06-18-2015, 06:16 AM

I think if a single or few bases dip but then it recovers the read will actually survive. This is a sliding window model which isn't super harsh to the data.

**bluepoison** · 11-28-2015, 09:46 AM

Hi all,

This is my first sequencing data analysing. I am having difficulties trimming the adapters/contaminants from the reads. I have got 50bp single paired read. I checked in fastqc that there are overrepresented sequences which are part of 'Illumina Paired End Adapter 2'. But If I trim using the whole 'Illumina Paired End Adapter 2', still there will be plenty of overrepresented sequences left!
Q1) On that case what how much should I trim?

I have these overrepresented sequence,
GATCGGAAGAGCGGTTCAGCAGG
GATCGGAAGAGCGGTTCAGCAGGA
GATCGGAAGAGCGGTTCAGCAGGAA
GATCGGAAGAGCGGTTCAGCAGGAAT
GATCGGAAGAGCGGTTCAGCAGGAATG
GATCGGAAGAGCGGTTCAGCAGGAATGC
GATCGGAAGAGCGGTTCAGCAGGAATGCC
GATCGGAAGAGCGGTTCAGCAGGAATGCCG
GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG (Illumina Paired End Adapter 2)

Also I have another sequence which all the 'no hit' contains! That sequence is 'GTTATTTTTTTGTTTTAGTTTTT'. I looked at the contaminant file and there is no match for this.
Q2)Should I trim this sequence without even actually knowing from which this sequence is coming from?

I planned to trim all the sequences from bigger to smaller using cudadapt because there is no way to trim multiple adapters at a time in trim galore. But later I will also use trim galore for quality trimming.
Q3)Is there any way to minimize these steps?

All the scenarios described above is true for all the seven samples I analysed. Also there is know way to know the actual adapters used from the dataset.

Thanks a lot!

**fkrueger** · 11-28-2015, 01:46 PM

Hi bluepoison,

The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix

**bluepoison** · 11-28-2015, 03:48 PM

Hi Felix,

Thanks a lot for quick response. It was really helpful for me.

I just performed a short experiment. Just wanted to share with you. I randomly pooled 1M reads, and made 3 following versions:
version 1: without any trimming
version 2: trim with Trim Galore with default settings
version 3: trim with Trim Galore with default settings and trim 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' with cutadapt.

Results in terms of efficiency after aligning with bismark b2:
version1: 39.7%
version2: 58.9%
version3: 58.2%

When I checked the qualities in FASTQC, even in version 3, it gave some very short (less than 10bp)overrepresented sequences as 'no hit'. So I guess it will always give some overrepresented sequences anyway but I have to understand very well what am I trimming.

One notable thing here is that the efficiency has not improved from version 2 to version 3. Most of the overrepresented sequences has the first part as 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' and second part as the basic standard Illumina paired-end adapter. So those sequences are already rejected from the alignment just after the doing the version 2. That's why version 3 hasn't change that much.

btw I saw several posts containing 'felix is a great guy!'. Now its making a lot more sense. thanks again!

Originally posted by fkrueger View Post

Hi bluepoison,

The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix

**fkrueger** · 11-28-2015, 03:52 PM

Oh dear, you should never post such things on the internet... but I'm glad it helped!

**Alex852013** · 12-03-2015, 08:34 AM

Understand the quality trimming

Hello everybody,

it is the first time i try to use trim_galore for quality trimming of paired end reads.
I checked for the sequencing settings with testformat.sh from BBMap which gives me:
sanger fastq raw single-ended 150bp
I'm not sure why there single-ended comes as an output, since it was paired-end.

Before i did the quality trimming, i checked with FastQC.
The programm didn't find adapter sequences any more (i guess they were already cut by the sequencing service) and showed the following pictures

Picture before quality trimming:

This is the line i used for trimming on unix command line.
trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

Picture after quality trimming:

I had expected, that everything with a quality below 20 would be cut. Therefore i either missinterpret something or i did something wrong.
May please someone tell me what it is?
Thanks a lot, Alex

**GenoMax** · 12-03-2015, 08:43 AM

Originally posted by Alex852013 View Post

Hello everybody,

This is the line i used for trimming on unix command line.
trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

Therefore i either missinterpret something or i did something wrong.
May please someone tell me what it is?
Thanks a lot, Alex

You appear to be running trim_galore incorrectly. Instead of trying to redirect the output (>) to a file you need to specify an output directory location by using a -o directory_path.

@felix will confirm. I don't use trim_galore.

Edit: Looking at trim_galore manual -o is not strictly needed. Program will use the current directory by default.

Edit2: @felix clarified the effect of output redirect in the post below.

**fkrueger** · 12-03-2015, 08:55 AM

Trim Galore should derive its output files from the filenames, so this will only redirect any other output to the screen to a file, so not overly useful but it won't harm.

The trimming algorithm to trim qualities is described in the Cudatapt option -q:

Code:

-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF
                        Trim low-quality bases from 5' and/or 3' ends of reads
                        before adapter removal. If one value is given, only
                        the 3' end is trimmed. If two comma-separated cutoffs
                        are given, the 5' end is trimmed with the first
                        cutoff, the 3' end with the second. [B]The algorithm is
                        the same as the one used by BWA (see documentation).[/B]
                        (default: no trimming)

This means that the qualities are assessed in windows over the read, and trimmed at a position where the score is lowest. If I understand this correctly then a read may temporarily 'dip' below the threshold you have selected, but allow the sequence to survive it the quality comes back up afterwards. So occasionally you might get a few scores that are lower than 20bp, but I personally wouldn't too worried about it as most downstream programs have their own means of dealing with low quality basecalls.

**Alex852013** · 12-04-2015, 06:50 AM

Thanks a lot

Thanks a lot, i guess i can go on on my own now!

**whargrea** · 12-21-2015, 01:56 PM

Hi,

I've been going through the documentation and searching forum threads etc. looking to see if trim_galore can be run in a multi-core multi-thread manner. So far the total lack of information in this regard seems to point towards it not having such a capability.

I'm not sure if this is the appropriate place to ask but I was wondering why this is the case? I have 48 files of ~120mil reads each that I need to perform trimming on and being able to parallelize would greatly boost the speed at which this could be done. It seems to me that since each read is trimmed independently trimming software should easily scale to any number of cores. Am I correct in this assumption or am I missing something?

Cheers.

**fkrueger** · 12-22-2015, 06:02 AM

Hi whargrea, the absence of documentation for parallelization does indeed mean that reads are trimmed by calling a single instance of Cutadapt at a time. Since trimming is a one-off process that doesn't really take that long (a matter of hours) compared to the data collection process (often a matter of several days) or other downstream operations (up to several weeks?) we don't tend to bother about it very much. The easiest solution would probably to run all your 48 trims in parallel (even though this might be quite intense on the disc I/O part), or try to find another trimmer that supports parallel trimming natively.

**Rob Weeks** · 01-19-2016, 08:09 PM

I have only just begun to look at RRBS data. I am trying to use trim_galore to quality trim and adaptor trim my sequences. I am doing this in OS X.
Now when I run 'trim_adaptor filename.fastq.gz' it returns an error due to 'zcat: can't stat: filename.fastq.gz (filename.fastq.gz.Z): No such file or directory".
This is apparently a problem only in OS X, but it is not clear to me how I can get around this problem.

Any ideas would be appreciated

Cheers

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 48 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News