SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
trim adapter from Illumina Genome Analyzer IIe miRNA reads NicoBxl Bioinformatics 5 01-02-2014 05:31 AM
Checking the Quality of RRBS libraries before actually running them twang11 Sample Prep / Library Generation 0 02-22-2012 04:18 PM
trim 3' adapter sequence for mRNA-Seq? slny Bioinformatics 14 06-14-2011 06:15 AM
csfasta quality hard trimming do i need to hard trim the qual file? KevinLam Bioinformatics 2 05-13-2010 02:27 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 12:53 PM

Reply
 
Thread Tools
Old 11-28-2015, 12:46 PM   #81
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Default

Hi bluepoison,

The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix
fkrueger is offline   Reply With Quote
Old 11-28-2015, 02:48 PM   #82
bluepoison
Junior Member
 
Location: Detroit

Join Date: Nov 2015
Posts: 4
Default

Hi Felix,

Thanks a lot for quick response. It was really helpful for me.

I just performed a short experiment. Just wanted to share with you. I randomly pooled 1M reads, and made 3 following versions:
version 1: without any trimming
version 2: trim with Trim Galore with default settings
version 3: trim with Trim Galore with default settings and trim 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' with cutadapt.

Results in terms of efficiency after aligning with bismark b2:
version1: 39.7%
version2: 58.9%
version3: 58.2%

When I checked the qualities in FASTQC, even in version 3, it gave some very short (less than 10bp)overrepresented sequences as 'no hit'. So I guess it will always give some overrepresented sequences anyway but I have to understand very well what am I trimming.

One notable thing here is that the efficiency has not improved from version 2 to version 3. Most of the overrepresented sequences has the first part as 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' and second part as the basic standard Illumina paired-end adapter. So those sequences are already rejected from the alignment just after the doing the version 2. That's why version 3 hasn't change that much.

btw I saw several posts containing 'felix is a great guy!'. Now its making a lot more sense. thanks again!
Quote:
Originally Posted by fkrueger View Post
Hi bluepoison,

The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix
bluepoison is offline   Reply With Quote
Old 11-28-2015, 02:52 PM   #83
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Default

Oh dear, you should never post such things on the internet... but I'm glad it helped!
fkrueger is offline   Reply With Quote
Old 12-03-2015, 07:34 AM   #84
Alex852013
Member
 
Location: Germany

Join Date: Jan 2013
Posts: 17
Default Understand the quality trimming

Hello everybody,

it is the first time i try to use trim_galore for quality trimming of paired end reads.
I checked for the sequencing settings with testformat.sh from BBMap which gives me:
sanger fastq raw single-ended 150bp
I'm not sure why there single-ended comes as an output, since it was paired-end.

Before i did the quality trimming, i checked with FastQC.
The programm didn't find adapter sequences any more (i guess they were already cut by the sequencing service) and showed the following pictures

Picture before quality trimming:



This is the line i used for trimming on unix command line.
trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

Picture after quality trimming:



I had expected, that everything with a quality below 20 would be cut. Therefore i either missinterpret something or i did something wrong.
May please someone tell me what it is?
Thanks a lot, Alex

Last edited by Alex852013; 12-03-2015 at 07:43 AM.
Alex852013 is offline   Reply With Quote
Old 12-03-2015, 07:43 AM   #85
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,888
Default

Quote:
Originally Posted by Alex852013 View Post
Hello everybody,

This is the line i used for trimming on unix command line.
trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

Therefore i either missinterpret something or i did something wrong.
May please someone tell me what it is?
Thanks a lot, Alex
You appear to be running trim_galore incorrectly. Instead of trying to redirect the output (>) to a file you need to specify an output directory location by using a -o directory_path.

@felix will confirm. I don't use trim_galore.

Edit: Looking at trim_galore manual -o is not strictly needed. Program will use the current directory by default.

Edit2: @felix clarified the effect of output redirect in the post below.

Last edited by GenoMax; 12-03-2015 at 07:59 AM.
GenoMax is offline   Reply With Quote
Old 12-03-2015, 07:55 AM   #86
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Default

Trim Galore should derive its output files from the filenames, so this will only redirect any other output to the screen to a file, so not overly useful but it won't harm.

The trimming algorithm to trim qualities is described in the Cudatapt option -q:

Code:
-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF
                        Trim low-quality bases from 5' and/or 3' ends of reads
                        before adapter removal. If one value is given, only
                        the 3' end is trimmed. If two comma-separated cutoffs
                        are given, the 5' end is trimmed with the first
                        cutoff, the 3' end with the second. The algorithm is
                        the same as the one used by BWA (see documentation).
                        (default: no trimming)
This means that the qualities are assessed in windows over the read, and trimmed at a position where the score is lowest. If I understand this correctly then a read may temporarily 'dip' below the threshold you have selected, but allow the sequence to survive it the quality comes back up afterwards. So occasionally you might get a few scores that are lower than 20bp, but I personally wouldn't too worried about it as most downstream programs have their own means of dealing with low quality basecalls.
fkrueger is offline   Reply With Quote
Old 12-04-2015, 05:50 AM   #87
Alex852013
Member
 
Location: Germany

Join Date: Jan 2013
Posts: 17
Default Thanks a lot

Thanks a lot, i guess i can go on on my own now!
Alex852013 is offline   Reply With Quote
Old 12-21-2015, 12:56 PM   #88
whargrea
Junior Member
 
Location: Guelph, Ontario

Join Date: Dec 2015
Posts: 1
Default

Hi,

I've been going through the documentation and searching forum threads etc. looking to see if trim_galore can be run in a multi-core multi-thread manner. So far the total lack of information in this regard seems to point towards it not having such a capability.

I'm not sure if this is the appropriate place to ask but I was wondering why this is the case? I have 48 files of ~120mil reads each that I need to perform trimming on and being able to parallelize would greatly boost the speed at which this could be done. It seems to me that since each read is trimmed independently trimming software should easily scale to any number of cores. Am I correct in this assumption or am I missing something?

Cheers.
whargrea is offline   Reply With Quote
Old 12-22-2015, 05:02 AM   #89
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Default

Hi whargrea, the absence of documentation for parallelization does indeed mean that reads are trimmed by calling a single instance of Cutadapt at a time. Since trimming is a one-off process that doesn't really take that long (a matter of hours) compared to the data collection process (often a matter of several days) or other downstream operations (up to several weeks?) we don't tend to bother about it very much. The easiest solution would probably to run all your 48 trims in parallel (even though this might be quite intense on the disc I/O part), or try to find another trimmer that supports parallel trimming natively.
fkrueger is offline   Reply With Quote
Old 01-19-2016, 07:09 PM   #90
Rob Weeks
Junior Member
 
Location: New Zealand

Join Date: Jan 2016
Posts: 2
Default

I have only just begun to look at RRBS data. I am trying to use trim_galore to quality trim and adaptor trim my sequences. I am doing this in OS X.
Now when I run 'trim_adaptor filename.fastq.gz' it returns an error due to 'zcat: can't stat: filename.fastq.gz (filename.fastq.gz.Z): No such file or directory".
This is apparently a problem only in OS X, but it is not clear to me how I can get around this problem.

Any ideas would be appreciated

Cheers
Rob Weeks is offline   Reply With Quote
Old 01-19-2016, 10:38 PM   #91
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Default

Quote:
Originally Posted by Rob Weeks View Post
I have only just begun to look at RRBS data. I am trying to use trim_galore to quality trim and adaptor trim my sequences. I am doing this in OS X.
Now when I run 'trim_adaptor filename.fastq.gz' it returns an error due to 'zcat: can't stat: filename.fastq.gz (filename.fastq.gz.Z): No such file or directory".
This is apparently a problem only in OS X, but it is not clear to me how I can get around this problem.

Any ideas would be appreciated

Cheers
This problem is indeed known and has to do with the version of zcat which is installed on your system. On very old versions it would alter the filename passed to it to append a .Z if there wasn’t one there, so although you’re trying to read a file called Sample1_PE_R1.fastq.gz it’s actually trying to read Sample1_PE_R1.fastq.gz.Z (which doesn’t exist).

I have changed the way Trim Galore reads from files from using a cat stream to using gunzip -c and it seems to work well. I can send you a copy of this tonight as I am at the Festival of Genomics in London all day if you send me an email. Alternatively you could try to change the filename of your input to end in .gz.Z and try that?
Good luck, Felix
fkrueger is offline   Reply With Quote
Old 01-19-2016, 10:41 PM   #92
Rob Weeks
Junior Member
 
Location: New Zealand

Join Date: Jan 2016
Posts: 2
Default

Never mind, I realised that decompressing my file and then running trim_galore will bypass zcat. It then works.

Cheers

Rob

EDIT: thanks Felix. I wrote this before I saw your response - I was "never minding" my question not your response!

I have a solution which works; I will decompress before running trim_galore

Last edited by Rob Weeks; 01-19-2016 at 10:46 PM. Reason: crossed responses
Rob Weeks is offline   Reply With Quote
Old 01-19-2016, 10:45 PM   #93
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Default

That would work, too, yes. Let me know if you've got any further questions.
fkrueger is offline   Reply With Quote
Old 02-06-2016, 10:16 AM   #94
rakarnik
Junior Member
 
Location: Cambridge, MA, USA

Join Date: Feb 2014
Posts: 1
Default *_trimmed vs. *_val

Hi,
I am using trim_galore in "--paired" mode. I know that trim_galore runs a second validation step after first running through the first and second reads separately, and is creating *val files as a result. Do these get renamed back to the *trimmed files afterwards, or should downstream analyses use the *val files?
Thanks for the help!
rakarnik is offline   Reply With Quote
Old 02-06-2016, 11:15 AM   #95
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Default

The temporarily created trimmed.fq files get deleted in the end, leaving only the _val_ files left for further analysis.
fkrueger is offline   Reply With Quote
Old 03-20-2016, 03:03 PM   #96
kevinrue
Member
 
Location: London

Join Date: Jan 2013
Posts: 25
Default

Dear Felix and fellow trim_galore users,

I am having a problem/confusion regarding trimming of Illumina adapters from paired-end bisulfite (and some non-bisulfite) samples of a whole-genome bisulfite sequencing project.
I am not sure what is going on here, and how I should process my reads.

First, running FastQC on my raw read pairs, I see both:
  • adapter contamination increasing toward the 3' of both forward and reverse reads (link (forward))
  • K-mer content clearly indicating over-representation of the reverse complemented Illumina adapter sequence in both forward and reverse reads (link (forward))

When I use trim_galore using either auto-detect or --illumina options:
  • trim_galore detects the adapter sequence in ~53% of the reads, making FastQC extremely happy with a perfect absence of adapter contamination in the trimmed files
  • reverse complemented adapter sequence still show up in the K-mer content
Basically, trim_galore cleanly removes any trace of the adapter sequence (AGATCGGAAGAGC), and ignores its reverse complement (well, it did its job!)

Without much surprise, when I told trim_galore to trim the reverse complement of the adapter sequence, the result of FastQC is:
  • trim_galore detects/removes the sequence in ~27% of the reads
  • Adapter contamination still detected toward the 3' of the reads
  • K-mer content does not show any more sign of reverse complemented adapter over-representation (link (forward))

A final piece of information that might help is that "grep"ing the adapter and rev_comp_adapter sequences in my raw reads, I observed the following:
  • the adapter sequence is typically found a unique time per read (when found), generally toward the 3' end
  • the reverse complemented sequence can be found multiple times within a single read (typically 1-3 times), generally toward the 5' of the read
Respective examples:
AGAATTCTGGTCTGTCCTATGGTTGTATGGCCCAGAACCATCCTCTTTTTGCTTTTCGTCAGCTCTCAGTTGGTTTCTCAGGGCCAGGTATTTGAAGGGCCTTGGGGTGATGTCATCCTTTAGTAATGACAGATCGGAAGAGCACACGTC
and
AATAGGGTGTGCTCTTCCGATCTATTCTGGTGTGCTCTTCCGATCTTGGCCGGTGTGCTCTTCCGACTTCTCTGTGACCCCAAGATCGGAAGAGCACACCGTGAAGCCCTCAAGAATGGGATACTGCCCTTTAAAAGAGGCCCCCGGGAG


In a nutshell, my question is: "What procedure is recommended is this case?"

I tried to do my homework, and saw earlier in this thread that trim_galore cannot be given multiple adapter sequences (I was thinking of giving both adapter and rev_com).

Any piece of advice would be greatly appeciated !
Kind regards,
Kevin
kevinrue is offline   Reply With Quote
Old 03-21-2016, 02:41 AM   #97
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Default

Hi Kevin,

thanks for providing so many details for your question, you don't get this very often...

As a general recommendation I would suggest that you simply run Trim Galore in its default mode (which you already did and which seems to have worked just fine), and just ignore the reverse complement of the adapter.

If you draw out a sketch of how libraries are constructed you will see that both reads (R1 and R2 of a library) should have a single copy of the sequence AGATCGGAAGAGC which then further extends into the rest of either the R1 or R2 adapter sequence (the sequence will then divert for R1 and R2, but for trimming purposes it doesn't matter how the sequence continues). You will further see that no sequence should have the reverse complement of the adapter sequence as a result of read-through adapter contamination. I would expect to find this reverse complemented sequence only as part of some weird processes that happened during library preparation, e.g. adapter or primer dimerisation or concatenation etc. (it should not be present in mammalian genomes at least).

The only thing that slightly caught me out is the occurrence of 27% which seems very high. This might mean that either you've got quite a lot dimerisation happening, or could it be that this is an aggregate number of all kinds of shorter sequences that have been removed, like G, GC, GCT, GCTC,....? In this case the sequences are probably just some random genomic sequences. If you run
Code:
grep -c GCTCTTCCGATCT your_file
is the number really that high?

In any case, I would recommend you just run Trim Galore in default mode and then proceed straight to mapping, I am pretty sure that it is the right thing to do.
Best wishes, Felix
fkrueger is offline   Reply With Quote
Old 03-21-2016, 03:42 AM   #98
kevinrue
Member
 
Location: London

Join Date: Jan 2013
Posts: 25
Default

Dear Felix,

Thanks for the quick answer, always appreciated!

I ran "grep -c GCTCTTCCGATCT" on the first million forward reads and got 5597 (0.56%)
To be comprehensive, I ran it on the first million reverse reads and got 10258 (1%)

By the way, you certainly understood that the 53% and 27% values in my previous post are those reported by trim_galore.

Now, I wonder where that the massive difference (27% reported by trim_galore; 0.56% exact matches grep-ed) comes from: is it due to partial rev_comp_adapter detected... although this seems a huge difference (see below for a comparison with the original adapter sequence), especially considering that the reverse complemented sequence tends to appear at the 5' of the read and therefore truncation should not be an issue for detection.
Illustration of trim_galore report for the rev-comp sequence:

Adapter sequence: 'GCTCTTCCGATCT' ()
Maximum trimming error rate: 0.1 (default)
Optional adapter 2 sequence (only used for read 2 of paired-end files): 'GCTCTTCCGATCT'
Minimum required adapter overlap (stringency): 1 bp
...
Total reads processed: 9,812,098
Reads with adapters: 2,662,297 (27.1%)
Reads written (passing filters): 9,812,098 (100.0%)


Also, running "grep -c AGATCGGAAGAGC" on the first million forward and reverse reads, respectively, returns:
243583 (24.4%)
238858 (23.9%)
In this case, I am confident that the difference between 53% (trim_galore) and 25% (grep) is due to partially adapter sequence.


I am looking at TruSeq® DNA Methylation Kit and can clearly how the adapter sequence, ligated to the 3' of ssDNA fragments, can appear by reading through short fragments.
Conversely, the rev_comp should clearly not appear in this context. I am not fully familiar with the chemistry of adapters, but the only way I can picture the rev_comp being sequenced is if the adapter gets ligated to the 5' of a fragment (or another adapter).

I guess I will do as you suggest:
  • align the adapter-trimmed paired reads
  • FastQC the aligned read mates, hoping that the reads showing rev_comp sequences did not map, and therefore do not appear in the K-mer content plot

By the way: we were already in touch about this project a while ago, because I had surprisingly low mapping efficiency (link). That might have been the issue. I just had to park the whole project for a while, and only got back to it recently.

Thanks!
kevinrue is offline   Reply With Quote
Old 03-21-2016, 03:58 AM   #99
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 614
Arrow

Hi Kevin,

The 27% of sequences that get trimmed is an aggregate value, which is calculated by the sum of sequences that had just anything trimmed, even if it as low as 1, 2, 3 bp etc. The Trim Galore report would give you the whole picture of what has been trimmed, but it normally looks something like this:

Code:
sequence trimmed
G             1000000
GC             600000
GCT             56671
GCTC             4232
...
GCTCTTCCGATCT       4
Good luck with resuming your projects!

By the way we have started aggregating sequencing related Fails at the QC Fail website:, might be worth visiting every now and then to see if there is anything useful that might explain problems you are seeing. Best, Felix
fkrueger is offline   Reply With Quote
Old 04-14-2016, 06:41 AM   #100
bowen
Junior Member
 
Location: SouthEast

Join Date: Sep 2009
Posts: 9
Question

Dear Dr. Krueger,
Thanks for a wonderful piece of stoftware.
I am running trim_galore on paired end reads, new version on mac os x, seems that python stops after 30M seqs processed sometimes? i have 64GB of RAM. it will stop and not finish. i get an output for the first fq of the pair but never a second one. i have gotten one set of PE reads to finish and create two trimmed .fq files. so I thought I had it all working. not sure why it may have stopped. maybe just a simple command error. not sure. thanks again.
command pasted below
best,
Nathan Bowen

CHR1:RNASeq_working nbowen$ trim_galore index21_GTTTCG_L001-L002_R1_001.fastq index21_GTTTCG_L001-L002_R2_001.fastq --path_to_cutadapt /Users/nbowen/Library/Python/2.7/bin/cutadapt --paired
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: '/Users/nbowen/Library/Python/2.7/bin/cutadapt' (user defined)
1.9.1
Cutadapt seems to be working fine (tested command '/Users/nbowen/Library/Python/2.7/bin/cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
===========================
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> index21_GTTTCG_L001-L002_R1_001.fastq <<)

Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Illumina 239964 AGATCGGAAGAGC 1000000 24.00
smallRNA 22 TGGAATTCTCGG 1000000 0.00
Nextera 12 CTGTCTCTTATA 1000000 0.00
Using Illumina adapter for trimming (count: 239964). Second best hit was smallRNA (count: 22)

Writing report to 'index21_GTTTCG_L001-L002_R1_001.fastq_trimming_report.txt'

SUMMARISING RUN PARAMETERS
==========================
Input filename: index21_GTTTCG_L001-L002_R1_001.fastq
Trimming mode: paired-end
Trim Galore version: 0.4.1
Cutadapt version: 1.9.1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp

Writing final adapter and quality trimmed output to index21_GTTTCG_L001-L002_R1_001_trimmed.fq


>>> Now performing quality (cutoff 20) and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file index21_GTTTCG_L001-L002_R1_001.fastq <<<
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
bowen is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO