SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
K-mer information and minimum contig size in SPAdes Tanner_6984 Bioinformatics 0 09-25-2014 11:33 AM
Contig length, k-mer coverage, and differential expression nbogard General 3 09-10-2013 11:30 AM
Selecting sequencing reads based on kmer counts tboothby Bioinformatics 2 09-04-2013 07:28 AM
picard error: Mismatch between read length and quals length writing read shawpa Bioinformatics 0 08-20-2012 05:52 AM

Reply
 
Thread Tools
Old 04-14-2015, 04:57 AM   #1
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default SPAdes: selecting K-mer based on read length

Hello Members,
I need your guidance in selecting k-mer for assembling bacterial genome, paired end using SPAdes.

I know and have read section
http://spades.bioinf.spbau.ru/releas...al.html#sec3.4

SPAdes is smart enough to select its own K-mer, and assemble thereafter.

My question is:
I'm automating assembly and downstream analysis for isolates over 1000. I'd like to understand how to select an optimal K-mer based on read length.
If I were to set K-mer based on read length, what should be those? And why?

Currently, I'm having Illumina 150X2 data.
K-mer(s) may differ based on coverage too. But lets say I've data of coverage less than 30X.

I came across below URL too.

https://www.biostars.org/p/58667/

Also, there are tools like k-mer genie and velvet_optimizer. (I've not tried them yet)

There's no recipe for these (k-mer, coverage) kind of situations, but there has to be an optimal way which might help to have reasonable output and results?

Last edited by bio_informatics; 04-14-2015 at 05:08 AM.
bio_informatics is offline   Reply With Quote
Old 04-14-2015, 08:54 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The optimal kmer-length should be less than read length Other than that there are not really any strict rules, just the longer the read, and the more coverage, generally the longer the kmer you can use. SPAdes, as far as I know, does not select a kmer length, but rather makes a combined assembly (by default) using multiple pre-selected kmer lengths of 55, 33, and 21. I think these values are low and a higher value, particularly for the max, would be better, at least for 150bp reads and good coverage.

We have not had a good experience with tools that automatically try to determine the best kmer length based on kmer frequency histograms, and I don't think there is any theoretical validity to that approach. Running multiple assemblies with different kmer lengths, and selecting the one with the best metrics, seems like the best approach - at least, if you are using a fast assembler, like Velvet. SPAdes is too slow for that approach.
Brian Bushnell is offline   Reply With Quote
Old 04-14-2015, 09:10 AM   #3
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

You can restart SPAdes with the addition of new k-mers if you think the assembly is not good enough. It combines the results from multiple k-mers, to me that seems like the best approach.
Chipper is offline   Reply With Quote
Old 04-14-2015, 09:20 AM   #4
milw
Director NGS Services, Lucigen
 
Location: Madison WI USA

Join Date: Dec 2013
Posts: 12
Default

I've been doing a lot of bacterial assemblies with SPades 3.5, and in all cases I've seen using kmer 99 or 127 works best in terms of contig # and N50. This has been with 2x 150 fragment PE plus a mate pair library. I had one trial set of K12 data that had fewer misassemblies using K77 than with K99 or 127.
milw is offline   Reply With Quote
Old 04-14-2015, 09:57 AM   #5
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Hi Milw,
Thank you for sharing your experience.
I do not have mate paired data. I hope that should not make a huge difference for k-mer outputs and resulting assembly?

I'll definitely try with 99 kmer and 127 as you've.

Quote:
Originally Posted by milw View Post
I've been doing a lot of bacterial assemblies with SPades 3.5, and in all cases I've seen using kmer 99 or 127 works best in terms of contig # and N50. This has been with 2x 150 fragment PE plus a mate pair library. I had one trial set of K12 data that had fewer misassemblies using K77 than with K99 or 127.

Last edited by bio_informatics; 04-14-2015 at 10:02 AM.
bio_informatics is offline   Reply With Quote
Old 04-14-2015, 10:02 AM   #6
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Hi Chipper,
Oh, yes; I was oblivious to this feature. Thank you for reminding.

Quote:
Originally Posted by Chipper View Post
You can restart SPAdes with the addition of new k-mers if you think the assembly is not good enough. It combines the results from multiple k-mers, to me that seems like the best approach.
bio_informatics is offline   Reply With Quote
Old 04-14-2015, 10:17 AM   #7
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Hi Brian,
Thanks for your valuable points. Definitely, k-mer won't be the read length, (un)fortunately :P
That's correct, SPAdes makes a combined assembly based on k-mer used.
But then again, the k-mer used are governed by the read length. Hence, was my question.

I wanted to understand - should I let SPAdes predict its usage of k-mer which it identifies by read length. OR, should I check read length and based on it, I can run as (example from its documentation):

Quote:
spades.py -k 21,33,55,77 -
As suggested by milw, and chipper; I should be attempting mentioned practices.
Thanks.
bio_informatics is offline   Reply With Quote
Old 04-15-2015, 05:05 AM   #8
milw
Director NGS Services, Lucigen
 
Location: Madison WI USA

Join Date: Dec 2013
Posts: 12
Default

Quote:
Originally Posted by bio_informatics View Post
Hi Milw,
Thank you for sharing your experience.
I do not have mate paired data. I hope that should not make a huge difference for k-mer outputs and resulting assembly?
.
It depends of course on what you're trying to assemble. I've had good luck with microbial BAC clones assembling completely just with fragment (paired-end) data- those are only ~100-150kb. Microbial genomes will probably give you a bunch of contigs.

Here's an example of fragment data only for 2.3Mb microbial genome, showing bigger contigs with increasing Kmer:

(final 'scaffolds' overlays final 'contigs' because there's no scaffolding without mate pair)

cheers- Scott

Last edited by milw; 04-15-2015 at 05:16 AM.
milw is offline   Reply With Quote
Old 04-20-2015, 04:32 AM   #9
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Thanks much Scott. :-)
bio_informatics is offline   Reply With Quote
Reply

Tags
assembly, k-mer, spades

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:13 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO