View Full Version : Genome size estimation for paired end data

11-18-2014, 03:32 AM

I am working with Illumina paired-end data with read length (l) 101 to generate de novo assembly. I am trying to predict genome size as per given formula, M = N * (L-K+1)/L.
The M value I found using Jellyfish which is around 3591. But I am confused since I have paired end file with 80284125 reads in each file, i.e. 80284125*2. When I have calculated genome size using only 80284125*101, I got expected genome size around 1.78MB.

But I am not sure whether genome size should be calculated how I have calculated above or using (80284125*101)*2, since data is paired end.

Also, it would be of great help if you tell how to get error k-mer distribution value.


Brian Bushnell
11-18-2014, 09:54 AM
If you download the BBMap (https://sourceforge.net/projects/bbmap/) package and run kmercountexact, it will tell you the size of the single-copy genomic peak, which is roughly the genome size for bacteria.

kmercountexact.sh in1=read1.fastq in2=read2.fq khist=khist.txt peaks=peaks.txt

You can plot the khist.txt if you want to look at it visually to see how many error kmers there are. "peaks.txt" will give you the location and size of the largest peaks; the number indicating the volume of the first peak should be approximately the genome size, though you can make it more accurate by factoring in the repeat peaks.

You can do some of this with Jellyfish, too, but I don't know the specifics.

11-18-2014, 10:24 PM
Thanks Brian. It, BBMAP, did work well, since it gave volume in first row which is very much near to my specie's nearby species' genomic size unlike jellyfish. Only thing is I had to subsample my very huge data, as with large dataset it threw exceptions.

Only one question in this aspect which you have mentioned in your message " though you can make it more accurate by factoring in the repeat peaks".. If you could clarify this, it would be of great help.

Brian Bushnell
11-19-2014, 08:27 AM
If you plot the kmer frequency histogram of a single haploid organism, you will typically see a peak at 0 (the error peak), then the next peak is typically the most prominent and corresponds to the size of the genome that is unique.

Then there will typically be more, smaller peaks at higher kmer depths. Let's say the first peak is at 40, and contains 3 million kmers. That implies that you have roughly 40x coverage, and the genome contains 3Mbp of unique sequence. There will probably be another peak at 80, which is a lot smaller. This is the peak from 2-copy repeat regions of the genome. If it contains 100,000 kmers, then the expected genome size would actually be increased by 200,000bp - because each of those kmers occurs twice. The three-copy peak would be at 120, and if it contained 100,000 kmers, it would contribute 300,000bp to the expected genome size.

So - if your first peak P is at depth D, with volume V, then let's label subsequent peaks at depth N*D as PN with volume VN - e.g. P2 is the peak at 2*D, and has a volume of V2. The total genome size would be:


In each case, you can get the multiplier N by round(DN/D1) where DN is the depth (center) of the current peak and D1 is the depth of the first peak.

But, my program does not have a very sophisticated peak-calling algorithm, so it won't find all of the peaks - it will probably find the first 2, and then maybe some more. You'll have to do that manually from looking at the graph of the kmer histogram, or use a better peak-caller. For most small organisms, though, the vast majority of the genome is single-copy so the higher peaks can be ignored if you just want a rough estimate.

11-19-2014, 07:56 PM
Thanks alot Brian once again for your comprensive explanation. Further peaks, after first error and second actual, there are no apparent peaks. So, I assume whaterver first row volume is there, is my genome size.