SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Using Geneious to analyse Whole Genome Sequencing data Morgane_AUS Bioinformatics 5 08-27-2014 12:27 AM
copy number changes from whole genome sequencing data adrian Bioinformatics 1 07-14-2014 02:41 PM
problem with low coverage genome sequencing wrch General 4 05-02-2014 11:56 AM
unique kmers in genome equation? JueFish Bioinformatics 0 10-10-2013 01:52 PM
how to analyze whole genome sequencing data (new genome assembling)? metheuse Genomic Resequencing 2 04-19-2013 06:06 PM

Reply
 
Thread Tools
Old 08-28-2017, 01:16 AM   #1
DeondeJager
Junior Member
 
Location: Pretoria, South Africa

Join Date: Jun 2017
Posts: 3
Default 5' Kmers problem with whole genome re-sequencing data

Hi all,

I am having some issues with getting rid of kmers in my sequencing data sets. We have re-sequenced several African buffalo genomes on an Illumina HiSeq X machine, 150bp paired-end reads. I ran FastQC on the raw reads and all samples pass the adapters and overrepresented sequences tests, but some fail the kmer test (length = 7 nucleotides). The samples either have kmers overrepresented at the 5' or 3' end. I can see that at least the 3' kmers are actually part of the Illumina adapters used (but FastQC is not picking these up in the adapter test), but the 5' kmers do not seem to be part of the adapters, as far as I can tell.

Trimmomatic successfully removes the 3' kmers of one sample, but not the 5' kmers of another sample. The same goes for Trim Galore (cutadapt), which can only remove adapter sequences from the 3' end of the reads (as far as I can tell from the cutadapt documentation). The per base quality is almost always above phred=20. This is the relevant part of the code I used for Trimmomatic and Trim Galore:

Trimmomatic: ILLUMINACLIP:/apps/chpc/bio/trimmomatic/0.36/bin/adapters/TruSeq3-PE-2.fa:2:30:10:1:true SLIDINGWINDOW:4:20 MINLEN:36

Trim Galore: trim_galore -q 20 --phred33 --stringency 1 -e 0.1 --length 36 -o /mnt/lustre/users/djager/trim_galore_out --paired --retain_unpaired -r1 37 -r2 37

Is there any way I can remove the 5' kmers without having to trim the 5' nucleotides of all the reads? Are these 5' kmers perhaps biological sequences?

The FastQC results are attached. The examples are all for the forward reads.
Attached Images
File Type: png Sample1_raw_Kmer profiles.png (98.2 KB, 6 views)
File Type: jpg Sample2_raw_Kmer profiles.jpg (74.9 KB, 7 views)
File Type: png Sample2_raw_Adapter content.png (106.6 KB, 4 views)
File Type: jpg Sample2_trimmomatic_Kmer profiles.jpg (77.7 KB, 6 views)
File Type: jpg Sample2_trim_galore_Kmer profiles.jpg (75.5 KB, 3 views)
DeondeJager is offline   Reply With Quote
Old 08-28-2017, 11:39 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

5' is most likely biological, a known artifact of fragmentation site bias and should be ignored. Nextera libraries in particular demonstrate this bias very strongly but it does not seem to have much impact downstream, and trimming it won't accomplish anything.
Brian Bushnell is offline   Reply With Quote
Old 08-28-2017, 11:56 AM   #3
DeondeJager
Junior Member
 
Location: Pretoria, South Africa

Join Date: Jun 2017
Posts: 3
Default

Quote:
Originally Posted by Brian Bushnell View Post
5' is most likely biological, a known artifact of fragmentation site bias and should be ignored. Nextera libraries in particular demonstrate this bias very strongly but it does not seem to have much impact downstream, and trimming it won't accomplish anything.
Thanks for the reply, Brian! I remember reading something about this fragmentation bias, now that you mention it, but I wasn't too sure if it applied to TruSeq as well. So I thought I'd rather be as thorough as possible and get some thoughts from the community before moving onto the next step.
DeondeJager is offline   Reply With Quote
Old 08-28-2017, 12:45 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

With TruSeq the shearing and adapters are independent, so I suppose it depends on your shearing process. I don't have any data for this but I would expect any enzymatic process to incur bias but sonication much less so. You can always test whether the uneven base composition is genomic or synthetic by mapping and calculating the mismatch rate by read position. If it is no higher at the spiky 3' end then the spikes are genomic and just caused by bias. If there is a high mismatch rate they are probably synthetic and should be trimmed, but I have never seen that happen in a normal fragment library.
Brian Bushnell is offline   Reply With Quote
Old 09-11-2017, 04:49 AM   #5
DeondeJager
Junior Member
 
Location: Pretoria, South Africa

Join Date: Jun 2017
Posts: 3
Default

Quote:
Originally Posted by Brian Bushnell View Post
You can always test whether the uneven base composition is genomic or synthetic by mapping and calculating the mismatch rate by read position. If it is no higher at the spiky 3' end then the spikes are genomic and just caused by bias. If there is a high mismatch rate they are probably synthetic and should be trimmed, but I have never seen that happen in a normal fragment library.
Hi again Brian,

I calculated the mismatch (or actually the match) rate from the .bam file using the mhist function in bbduk (mapping was done with bwa and of course the .bam files were produced with samtools).

Code:
~/programs/bbmap/bbmap/bbduk.sh in=B98_289_DSW37627_HJMJMALXX_L2_clean_aln-PE_sorted.bam mhist=mhist.txt qhist=qhist.txt out=bbdukhist.txt
For the sample with overrepresented kmers at the 5' end (sample2 in fastqc plots above = sample B98_289 in histograms below) there is a slight decrease in the match rate for the 5' and 3' ends of the reads, compared to the middle of the reads. However, this is not different from a sample that had no overrepresented kmers anywhere (sample B98_579). Therefore, it seems that the overrepresented kmers are genomic and not synthetic. Would you agree?
Attached Images
File Type: png mhist_B98_289.png (21.8 KB, 5 views)
File Type: png mhist_B98_579.png (31.9 KB, 4 views)
DeondeJager is offline   Reply With Quote
Old 09-11-2017, 11:04 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by DeondeJager View Post
For the sample with overrepresented kmers at the 5' end (sample2 in fastqc plots above = sample B98_289 in histograms below) there is a slight decrease in the match rate for the 5' and 3' ends of the reads, compared to the middle of the reads. However, this is not different from a sample that had no overrepresented kmers anywhere (sample B98_579). Therefore, it seems that the overrepresented kmers are genomic and not synthetic. Would you agree?
Yes, it sounds like it's probably fine to me... no trimming needed.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
fastqc, genome sequencing, kmers, trim galore, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:43 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO