Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Estimating heterozygosity from kmer frequency distribution MeganS Bioinformatics 4 09-03-2015 10:15 AM
How to estimating the genome size yanij Bioinformatics 18 09-10-2013 07:19 AM
estimate genome size through kmer analysis plantae Bioinformatics 0 07-05-2012 04:46 AM
estimate genome size through kmer analysis plantae De novo discovery 0 07-05-2012 04:36 AM
Estimating genome size and coverage newbie25 454 Pyrosequencing 2 08-12-2010 10:34 AM

Thread Tools
Old 05-31-2013, 07:03 AM   #1
Location: Sweden

Join Date: Jan 2012
Posts: 45
Default Estimating the bacterial genome size using Kmer frequency


How to estimate the bacterial genome size (GC rich) when there was no close reference genome ?

At first i tried jellyfish and generated the histogram plots (for all the avail kmers) and here the exact peak (what i guess) were identifed and calculated, but i am only getting less than half (too less ) off the genome size when compared to generated assemly produce from soapdenovo2.

And then i tried kmergenie (for all different kmers) same i am not getting proper estimation..

* Illumina hiseq : Paired end data : Read length 100bps ;
* GC perecent : 63 % ; (Read_1)
* Duplicates in fastq : ->48% (Read_1)
* Read_1 :10128605 (data from FastQC)

Any Suggestions could be really greatfull..

Thank you very much..

Last edited by Krish_143; 06-05-2013 at 03:30 AM.
Krish_143 is offline   Reply With Quote
Old 05-31-2013, 08:10 AM   #2
Senior Member
Location: Boston area

Join Date: Nov 2007
Posts: 747

#distinct kmers / 2 should be the genome size with a few important caveats

1) Including erroneous kmers will inflate the count, so typically would count only those kmers with a count of >=2

2) Repeat regions will be collapsed

3) regions that just don't show up will be missed, again underestimating. With high G+C genome, there may be regions simply missing from Illumina or with very low coverage.

Ray produces the kmer statistics in a way that is easy to parse & generate these estimates.

Assemblies are often a bit too large due to missed overlaps. If you convert these histograms to genome size estimates, how big a range is covered?

Even without a reference, the taxonomy of the bug may suggest a range -- though you could well have something outside that range.
krobison is offline   Reply With Quote
Old 05-31-2013, 08:28 AM   #3
Location: Sweden

Join Date: Jan 2012
Posts: 45

Hi krobison,

when i estimted the genome size using kmer information (histogram, kmer Peaks)
ESti_Gsize: 2.8mb (at Kmer 31)
Assembled Gsize using SoapDenovo : 5.7mb (Draft)

I will check with the Ray and very thanks krobison for the quick response.

Last edited by Krish_143; 06-02-2013 at 02:06 AM.
Krish_143 is offline   Reply With Quote
Old 06-02-2013, 03:13 PM   #4
Location: France

Join Date: Jan 2013
Posts: 13

I sometimes observe that SOAPdenovo contigs (not scaffolds) tend to assemble more than the genome size. Did you run a Velvet assembly, and if so, what was the assembly size?
rchikhi is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 01:00 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO