SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
NGS data-QC, Kmer, Genome size finding bioman1 Bioinformatics 1 01-24-2014 04:23 AM
Estimating the bacterial genome size using Kmer frequency Krish_143 Bioinformatics 3 06-02-2013 03:13 PM
estimate genome size through kmer analysis plantae Bioinformatics 0 07-05-2012 04:46 AM
estimate genome size through kmer analysis plantae De novo discovery 0 07-05-2012 04:36 AM
kmer coverage in Trinity Kiroro Bioinformatics 0 09-11-2011 08:24 PM

Reply
 
Thread Tools
Old 08-12-2015, 08:40 AM   #1
jov14
Member
 
Location: Germany

Join Date: Oct 2014
Posts: 16
Default kmer size and coverage cutoff for digital normalization using the khmer suite

Hi,

I want to use digital normalization on a set of single cell sequencing data as well as metagenomic date from low complexity communities. I'm probably missing some really obvious point, but I just really not sure how to apply the recommended diginorm cutoffs to my relatively long Miseq-reads.

Both, our single cell sequencing and our low-complex metagenomic sequencing data, were produced on a Miseq, yielding several million paired-end reads of ~250-300 bp length each.

The general recommendations in the khmer documentation state that you should normalize to a coverage of 1x to 5x using three-pass normalization and a kmer size of 20.

My question is: are those recommendations really suited for modern "long read" illumina data? If i reduce the kmer coverage for all kmers of length 20 to 5x or less, won't that reduce the coverage for larger kmers far too extremely?

Without diginorm, the optimal kmer-size using e.g. metavelvet is mostly around k81-101 for my datasets. How can there be enough kmer-coverage left for kmers at that size for deBruiJn-graph based assemblies if already the kmers of length 20 are reduced to less than 5x coverage?

My version of khmer doesn't seem to support using kmers larger than 31 so apparently larger kmer-sizes are simply not needed for diginorm. I just do not understand why...
jov14 is offline   Reply With Quote
Old 08-13-2015, 04:20 AM   #2
titusbrown
Junior Member
 
Location: Midwest

Join Date: Aug 2013
Posts: 8
Default diginorm k-mer size/coverage doesn't directly correlate with assembly parameters

Hi jov14,

the short answer is that because khmer/diginorm retains or rejects entire reads, the k-mer size and coverage of that process are only weakly connected with what the assembler sees and does. That having been said, we are working on increasing k size and doing things like memory efficient error correction instead, which would give you more choices.

A slightly longer answer: what diginorm is actually doing is aligning the reads to the De Bruijn graph, and while the alignment process depends on k, the alignment itself is not so sensitive to k. Then, diginorm looks at the coverage of the alignment in the graph and decides whether to accept or reject the read. This changes the coverage from random/whole genome shotgun to systematic/smooth, which has many (often good) effects on the resulting assembly. But it also tweaks the coverage distribution - while a coverage of 5 would be disastrous for whole genome shotgun (because you'd miss ~5% of bases!) the variance on the diginormed data is much lower, so you get a reduced set of reads that still contain all the information of the original set of reads.

I hope that helps!
titusbrown is offline   Reply With Quote
Old 08-13-2015, 04:50 AM   #3
titusbrown
Junior Member
 
Location: Midwest

Join Date: Aug 2013
Posts: 8
Default

Oh, sorry, to answer your original question:

I would suggest running a single pass C=20/k=20, and only doing further error trimming etc if you are running into out-of-memory problems. We've found C=20/k=20 works pretty well for most sequence.
titusbrown is offline   Reply With Quote
Old 08-13-2015, 05:29 AM   #4
jov14
Member
 
Location: Germany

Join Date: Oct 2014
Posts: 16
Default

Thanks for your answer and suggestion!
After Iposted this "problem" and had some more time to think again it came back to me:
Since, as you say, Diginorm only starts to exclude reads if ALL kmers in a read already have counts higher than the cutoff and reads are always kept if even only one new kmer is present in the read, of course the final kmer coverage for each individual kmer will be much higher than the cutoff. I simply forgot that and my problem is really nonexistant.

Acutally I already used three pass normalization procedures on previous data (where I had read lengths of 100 bp) using C=20 in the first pass and C=5 in the third (must have picked that up in one of your tutorials somewhere).
I usually then do two assemblies, one with first-pass-normalized data and one with third-pass-normalized data and then just pick the assembly that looks best (At least for single cell data both are usually way better than with non-normalized data).

However, would you say that for higher read lengths higher kmer values would bring some advantages (I would expect at least the identification of unique kmers for the kmer-trimming/error-correction-step would be perhaps more specific), or would you say the values should better just be left as they are?
jov14 is offline   Reply With Quote
Old 08-13-2015, 07:17 AM   #5
titusbrown
Junior Member
 
Location: Midwest

Join Date: Aug 2013
Posts: 8
Default

You can probably get slightly better performance on nasty large repetitive genomes with larger k-mers, for sure! I balance that in my lab against the point that we feel very comfortable with k=20/C=20 for transcriptomes and metagenomes based on our personal experience.

Report back if you play around - I'd love to hear more!
titusbrown is offline   Reply With Quote
Reply

Tags
diginorm, digital normalization, khmer suite, kmers, miseq v3 pe300

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO