SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   5' Kmers problem with whole genome re-sequencing data (http://seqanswers.com/forums/showthread.php?t=77779)

DeondeJager 08-28-2017 12:16 AM

5' Kmers problem with whole genome re-sequencing data
 
5 Attachment(s)
Hi all,

I am having some issues with getting rid of kmers in my sequencing data sets. We have re-sequenced several African buffalo genomes on an Illumina HiSeq X machine, 150bp paired-end reads. I ran FastQC on the raw reads and all samples pass the adapters and overrepresented sequences tests, but some fail the kmer test (length = 7 nucleotides). The samples either have kmers overrepresented at the 5' or 3' end. I can see that at least the 3' kmers are actually part of the Illumina adapters used (but FastQC is not picking these up in the adapter test), but the 5' kmers do not seem to be part of the adapters, as far as I can tell.

Trimmomatic successfully removes the 3' kmers of one sample, but not the 5' kmers of another sample. The same goes for Trim Galore (cutadapt), which can only remove adapter sequences from the 3' end of the reads (as far as I can tell from the cutadapt documentation). The per base quality is almost always above phred=20. This is the relevant part of the code I used for Trimmomatic and Trim Galore:

Trimmomatic: ILLUMINACLIP:/apps/chpc/bio/trimmomatic/0.36/bin/adapters/TruSeq3-PE-2.fa:2:30:10:1:true SLIDINGWINDOW:4:20 MINLEN:36

Trim Galore: trim_galore -q 20 --phred33 --stringency 1 -e 0.1 --length 36 -o /mnt/lustre/users/djager/trim_galore_out --paired --retain_unpaired -r1 37 -r2 37

Is there any way I can remove the 5' kmers without having to trim the 5' nucleotides of all the reads? Are these 5' kmers perhaps biological sequences?

The FastQC results are attached. The examples are all for the forward reads.

Brian Bushnell 08-28-2017 10:39 AM

5' is most likely biological, a known artifact of fragmentation site bias and should be ignored. Nextera libraries in particular demonstrate this bias very strongly but it does not seem to have much impact downstream, and trimming it won't accomplish anything.

DeondeJager 08-28-2017 10:56 AM

Quote:

Originally Posted by Brian Bushnell (Post 210476)
5' is most likely biological, a known artifact of fragmentation site bias and should be ignored. Nextera libraries in particular demonstrate this bias very strongly but it does not seem to have much impact downstream, and trimming it won't accomplish anything.

Thanks for the reply, Brian! I remember reading something about this fragmentation bias, now that you mention it, but I wasn't too sure if it applied to TruSeq as well. So I thought I'd rather be as thorough as possible and get some thoughts from the community before moving onto the next step.

Brian Bushnell 08-28-2017 11:45 AM

With TruSeq the shearing and adapters are independent, so I suppose it depends on your shearing process. I don't have any data for this but I would expect any enzymatic process to incur bias but sonication much less so. You can always test whether the uneven base composition is genomic or synthetic by mapping and calculating the mismatch rate by read position. If it is no higher at the spiky 3' end then the spikes are genomic and just caused by bias. If there is a high mismatch rate they are probably synthetic and should be trimmed, but I have never seen that happen in a normal fragment library.

DeondeJager 09-11-2017 03:49 AM

2 Attachment(s)
Quote:

Originally Posted by Brian Bushnell (Post 210479)
You can always test whether the uneven base composition is genomic or synthetic by mapping and calculating the mismatch rate by read position. If it is no higher at the spiky 3' end then the spikes are genomic and just caused by bias. If there is a high mismatch rate they are probably synthetic and should be trimmed, but I have never seen that happen in a normal fragment library.

Hi again Brian,

I calculated the mismatch (or actually the match) rate from the .bam file using the mhist function in bbduk (mapping was done with bwa and of course the .bam files were produced with samtools).

Code:

~/programs/bbmap/bbmap/bbduk.sh in=B98_289_DSW37627_HJMJMALXX_L2_clean_aln-PE_sorted.bam mhist=mhist.txt qhist=qhist.txt out=bbdukhist.txt
For the sample with overrepresented kmers at the 5' end (sample2 in fastqc plots above = sample B98_289 in histograms below) there is a slight decrease in the match rate for the 5' and 3' ends of the reads, compared to the middle of the reads. However, this is not different from a sample that had no overrepresented kmers anywhere (sample B98_579). Therefore, it seems that the overrepresented kmers are genomic and not synthetic. Would you agree?

Brian Bushnell 09-11-2017 10:04 AM

Quote:

Originally Posted by DeondeJager (Post 210845)
For the sample with overrepresented kmers at the 5' end (sample2 in fastqc plots above = sample B98_289 in histograms below) there is a slight decrease in the match rate for the 5' and 3' ends of the reads, compared to the middle of the reads. However, this is not different from a sample that had no overrepresented kmers anywhere (sample B98_579). Therefore, it seems that the overrepresented kmers are genomic and not synthetic. Would you agree?

Yes, it sounds like it's probably fine to me... no trimming needed.


All times are GMT -8. The time now is 02:34 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.