SEQanswers

Go Back   SEQanswers > Introductions



Similar Threads
Thread Thread Starter Forum Replies Last Post
MiSeq gDNA reads still fail "Kmer content" and "per base seq content" after trimming" ysnapus Illumina/Solexa 4 11-12-2014 07:25 AM
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 10:45 PM
question about kmer content rzeng Bioinformatics 1 09-08-2013 07:47 AM
Kmer Content wvie Bioinformatics 3 08-11-2012 08:07 AM
Kmer content subuhikhan General 9 03-05-2012 12:05 AM

Reply
 
Thread Tools
Old 08-26-2015, 06:34 AM   #1
henriettevdz
Junior Member
 
Location: South Africa, Potchefstroom

Join Date: Aug 2015
Posts: 3
Question Advice needed on De novo sequences Kmer content

Good day,

I need some advice on the Kmer content of my de novo project. I've sequenced the genome of a lovebird (parrot) species. Here are some details:

- We sequenced the offspring at 100x coverage and its parents at 30x coverage on Illumina Hiseq 2500
- The offspring had 3 PE libraries of 300, 550 and 750 bp, the parents 2 PE libraries of 300 and 550bp
- The offspring had 2 LJD MP libraries of 3 and 8 kb
- The read lengths were 125bp but after trimming by the service providers they were 30-125bp long
- The genome has a GC content of around 43%
- Overall the FastQC files look good and the only problem is the Kmer content

Here is the problem... It seems that there is a Kmer bias around 42-54 bp on all 3 the samples.

It looks if it is part of the Illumina TruSeq adapter, but it isn't given as an over represented sequence. The sequence is:
5 GATCGGAAGAGCACACGTCTGAACTCCAGTCAC‐NNNNNN-ATCTCGTATGCCGTCTTCTGCTTG 3

I have attached two screenshots from the Kmer contents here. Most of the FastQC reports look like this, for all 3 the birds.

We have discussed it with the service provider, but they feel we don't have to worry at all.

Has anybody experienced anything like this before? Can you offer some help please?

Thank you in advance!
Henriette
Attached Images
File Type: png Screenshot 2015-08-03 09.09.01(2).png (122.8 KB, 16 views)
File Type: png Screenshot 2015-08-03 09.10.06.png (144.4 KB, 12 views)
henriettevdz is offline   Reply With Quote
Old 08-26-2015, 10:44 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Since your reads are variable length, they should have already been trimmed. But perhaps the trimming did not work very well, due to (for example) low quality. If the reads have adapter contamination, you can find it using BBMerge:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt outadapter=adapter.fa reads=4m

The insert size histogram will also be informative (insert sizes shorter than read length indicate adapter contamination). Ideally (in this case), very few of your reads will even overlap, so they won't merge. Once this finishes, you can try trimming the sequences with BBDuk like this:

bbduk.sh in1=read1.fq in2=read2.fq ref=adapter.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe

...which will report the number of reads with adapter sequence. You can alternately (or additionally) use the adapter sequence file distributed with BBDuk since it has all standard Illumina adapters, but you never know what a random provider used.

Since your target fragment lengths were, at a minimum, 300bp, there should be virtually zero adapter sequence present in 125bp reads. If there is, it indicates that your target insert sizes were probably not hit, or short fragments were not correctly removed. If you see adapter contamination in these trimmed reads, there was probably a serious problem upstream and you may need to request the sequencing to be redone, or else trim them correctly starting with the raw, untrimmed reads.
Brian Bushnell is offline   Reply With Quote
Old 08-26-2015, 09:33 PM   #3
henriettevdz
Junior Member
 
Location: South Africa, Potchefstroom

Join Date: Aug 2015
Posts: 3
Default

Thanks for your reply Brian! I really appreciate your help.

I forgot to add that the PE libraries are all fixed length of 125bp and it is only the LJD libraries that are variable length and trimmed by the service provider.

The biggest problem we have is that the over represented Kmers are found in the middle of the read and it isn't the whole adapter sequence, but only a part of it. The regions before and after the over represented sequence is of good quality.

Thanks!
Henriette
henriettevdz is offline   Reply With Quote
Old 08-27-2015, 09:28 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Well, just because the over-represented kmers are reported as being shorter then the adapter sequence does not mean the entire adapter is not present. I suggest you try adapter-trimming the reads using the adapter set included with BBDuk and see if that resolves the problem.
Brian Bushnell is offline   Reply With Quote
Old 08-28-2015, 04:28 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,156
Default

Quote:
Originally Posted by henriettevdz View Post
The biggest problem we have is that the over represented Kmers are found in the middle of the read and it isn't the whole adapter sequence, but only a part of it.
Henriette
The FastQC Kmer plot shows only the top 6 most abundant kmers. It is very likely that all kmers for the full adapters are over represented it just so happens that it is that spot in the middle is most abundant. Examine the full FastQC report (fastqc_data.txt) and you will likely be able to reconstruct all/most of the adapter from the full list of abundant kmers.
kmcarr is offline   Reply With Quote
Old 08-30-2015, 08:23 AM   #6
henriettevdz
Junior Member
 
Location: South Africa, Potchefstroom

Join Date: Aug 2015
Posts: 3
Default BBMerge trim

Dear Brian,

We have trimmed the adapter sequences and I've attached the two FastQC kmer content files of the same runs. (only the one sample and from the 300 and 550 libraries). The number of sequences were reduced from around 50 000 000 to 4 000 000. Is this what we could expect from the data?

Thanks!
Henriette
Attached Images
File Type: png Father 300 L003 R2 after trim.png (56.2 KB, 5 views)
File Type: png Father 300 L003 R2 before trim.png (53.3 KB, 2 views)
File Type: png Father 550 L003 R1 after trim.png (102.5 KB, 6 views)
File Type: png Father 550 L003 R1 before trim.png (65.2 KB, 2 views)
henriettevdz is offline   Reply With Quote
Old 08-31-2015, 08:35 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

To be honest, I tend to find the overrepresented kmer graphs fairly confusing and rely more on the base frequency by position. If the total number of sequences was reduced from 50 million to 4 million then you have a major problem with the raw data and it needs to be re-run (possibly with higher molecular weight input DNA). Or did you mean 40 million?

It would be helpful to see the mapping results to an assembly, but with 4m reads, you won't get much of an assembly. So...

First off, can you post the stderr (console) output of BBDuk?

Second, it would be useful if you could run BBMerge on the raw input and post the console output, and attach the insert size histogram, for the 300bp and 550bp libraries, like this:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt reads=4m

Also, the entire FastQC report from before trimming (in PDF, or, at least, the base composition histogram) would be useful. It seems like maybe you have a huge number of adapter-dimers, or very short inserts.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:23 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO