SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Reply
 
Thread Tools
Old 01-22-2016, 06:06 AM   #1
kostask
Junior Member
 
Location: Greece

Join Date: Sep 2015
Posts: 8
Default Adapter_and_kmer_trimming

Hi everyone,

I am using publicly available, 51 bp paired-end RNA seq data and I have some questions concerning the quality trimming of the data before passing them to Tophat2 for mapping.

Specifically I do not know which adapters were used, so I used fastqc and then trim_galore to remove the default illumina adapter "AGATCGGAAGAGC" and one overrepresented sequence "CTTTGTGTTTGATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT".

It is really important to remove as much adapter contamination as possible because my analysis has to do with discovering variations that may correspond to RNA editing, rather than studying gene expression.

So my questions are:

1) I am still getting 3 kmers in numbers ranging from 500 to 1800 that can be found within illumina adapters, and are reported to be in the middle of the read's length (positions 20, 34 and 41). Each one is found in a different adapter and an RNA PCR Index primer.

Should I use trim_galore to remove these kmers from my reads?

kmers in fastqc found in illumina adapters, marked in red boxes:

SRR1524292_1_val_1_fastqc_kmers.png

2) I have already performed removal with trim_galore for these kmers and trimming to improve Per base sequence content.

However the kmer GTACGTA appears in my fastqc report, and this kmer can be found in the TruSeq Adapter, Index 22. This adapter begins with GATCGGAAGAGC and should have been removed during the first step of trim_galore --illumina.

Should this kmer be removed as well?

Generally is it possible for kmers to be found within illumina adapters by chance?

3) After applying trim_galore --illumina the Per base sequence content of the 3 prime end of the reads is starting to show divergence, which gets worse every time I remove a sequence.

Is this because of the different length of the reads because some are trimmed more than others? (read length 51 to read length 20-51)

Should I trim the 3 prime end of the reads in this case?

Data before trimming: Per_base_sequence_content_SRR1524292_1.png

Data after trim_galore --illumina: Per_base_sequence_content_SRR1524292_1_after_trim_galore--illumina.png

Data after trim_galore a)--illumina, b)kmers and c)overrexpressed sequence: Per_base_sequence_content_after_removing_--illumina_kmers_overrepresented_seq.png

4) My last question is: Is tophat2 going to have a problem in alighning paired end reads with length ranging from 20 to 32?
kostask is offline   Reply With Quote
Old 01-22-2016, 06:33 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,814
Default

I would recommend trying bbduk from BBMap. @Brian includes all common adapters you are likely to run into and they are included in the "resources" directory in BBMap download and will be scanned at the same time without you having to provide them ad hoc.

I would not worry about the kmers (unless you see an issue after alignment) since they may be real part of the data.

While you use TopHat, go ahead and try BBMap (it is splice aware) as an alternate aligner.
GenoMax is offline   Reply With Quote
Old 01-22-2016, 01:51 PM   #3
kostask
Junior Member
 
Location: Greece

Join Date: Sep 2015
Posts: 8
Default

Thank you GenoMax, I will be sure to check out BBMap.

Quote:
I would not worry about the kmers (unless you see an issue after alignment) since they may be real part of the data.
So your opinion is that the presence of at least some of these kmers in illumina adapters is random? Or that it will pose no problem?
kostask is offline   Reply With Quote
Old 01-22-2016, 06:58 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

TrimGalore appears to be overly aggressive in trimming the ends of the reads, by trimming down to only a few bp match at the very end, or something similar. BBDuk's recommended default of "mink=11" avoids this by using a minimum of an 11bp sequence match at the end. The histogram of the raw data showed no evidence of adapter contamination, but I still recommend trimming, since there's always some. Just, not with such aggressive settings, as they will introduce bias.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
adapter remove, fastqc, kmers, tophat 2, trim galore

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO