SEQanswers Genome size estimation using BBMAP
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post mandar.bobade60 General 1 02-10-2015 11:09 PM mandar.bobade60 General 4 11-19-2014 08:56 PM bioman1 Bioinformatics 3 08-18-2014 12:14 PM moinul De novo discovery 9 04-04-2014 04:22 AM pcmount87 Bioinformatics 0 10-20-2013 08:38 AM

11-06-2015, 09:12 AM   #1
syintel87
Member

Location: Universe

Join Date: Dec 2012
Posts: 81
Genome size estimation using BBMAP

Hi all,
I have 21,005,534 x 2 paired-end genome sequencing data.
To estimate the genome size, I used BBMAP.
Is there anyone who could help to interpret the output?

The command:
Code:
`./kmercountexact.sh in1=read1.fastq in2=read2.fastq khist=khist.txt peaks=peaks.txt`
The 'peaks.txt' output:
Quote:
 #k 31 #unique_kmers 1432946339 #main_peak 31 #genome_size 285812914 #haploid_genome_size 95270971 #fold_coverage 31 #haploid_fold_coverage 112 #ploidy 3 #het_rate 0.03413 #percent_repeat 6.291 #start center stop max volume 13 31 58 6211343 100786874 58 65 71 73206 913749 71 112 170 1196232 41304813 170 224 285 19062 1199939 285 324 325 3981 138959 325 329 396 4084 201426 396 414 443 1692 75732 443 447 525 1481 92625 525 526 535 833 7869 535 539 541 822 4724 541 543 557 797 12164 557 565 1413 725 181119
Q1.
to calculate the genome size, 100786874 + 913749/2 + 41304813/3 + ... + 12164/11 + 181119/12 = 144,919,993.
However, on the 4th line of the output, the genome size is 285,812,914.
What is the real genome size?

Q2.
Also, what does each line of the output mean (e.g. unique_kmers, haploid_genome_size, haploid_fold_coverage, ploidy, het_rate, percent_repeat)?

 11-06-2015, 10:38 AM #2 Brian Bushnell Super Moderator   Location: Walnut Creek, CA Join Date: Jan 2014 Posts: 2,707 Hi! Unfortunately, it looks like ploidy was estimated wrong; I don't think anything has a ploidy of 3. If the peak calls are correct, it's probably a highly polymorphic tetraploid, or a highly repetitive diploid or haploid. The full genome size would be approximately: 1*100786874 + 2*913749 + 4*41304813 + 8*1199939 + 10*138959 ...etc Yielding 278822726, and so including all the peaks it would be around 285812914 as the program indicated. However the multipliers for the various peaks are hard to determine. Generally, the multiplier is the the peak's center divided by the first peak's center, so the first multiplier is 31/31=1, the second is 65/31=2. But the third is 112/31=3.6, which is hard to tell if it's 3 or 4 (they are all integers). It's probably 4. Thus the genome is around 285Mbp all told, but it's hard to figure out the ploidy which makes it hard to figure out the haploid size. If it's a haploid then the genome is ~285MB. If it's tetraploid then you divide it by 4 and get 4 copies of a 71Mbp genome. For example, a human cell has around 6 Gbp of DNA in it, but because humans are diploid, the haploid representation of the genome is around 3 Gbp long. The number of unique kmers is not important. Fold coverage is the amount of times you have sequenced each base, on average. Haploid genome size is the size of the genome divided by the ploidy. Haploid fold coverage is the number of times you have sequenced each base in the haploid representation, if the ploidy estimate was correct, which it isn't. Ploidy is the number of copies of each chromosome per cell. Het rate is an estimate of the fraction of the genome that is heterozygous (usually meaning a SNP); so 0.03413 indicates that 3.4% of the genome is heterozygous, meaning there is a SNP every 30 bases, which is very high. Percent repeat is the percent of the genome that occurs more than once as an exact copy; this number comes from analyzing the high-depth peaks, which come from repeat regions. However, if the ploidy estimate is off, then the het rate and percent repeat will also be off. Usually the ploidy estimate works well for haploid and diploid organisms, but tetraploid and higher are difficult. As for the post you pointed to, I was totally wrong about the formula, it should have been multiplication, not division. I will edit it. Last edited by Brian Bushnell; 11-06-2015 at 10:41 AM.
 07-10-2017, 09:43 AM #3 horvathdp Member   Location: Fargo Join Date: Dec 2011 Posts: 64 Hi All (and particularly Brian Bushnell), I am trying to estimate the genome size of my leafy spurge genome with BBMAP. The ouput says my genome size is a haploid of ~950MB. However, using a procedure described for Jellyfish kmer counts – (sum of (kmer count* the number of kmers))/the peak fold I get 4.6 Gb. The flow cytometry data we have for leafy spurge is 2.1Gb for the 1x genome. From the discussion above, it seems that the BBMAP program measures area under the curve for each peak to get it’s genome size estimate. Given the additional data from the flow cytometry results, I am inclined to believe the method described for the Jellyfish data, but I was wondering why the two methods give such different results, and if there was any reason to prefer the BBMAP method.
 07-10-2017, 09:54 AM #4 Brian Bushnell Super Moderator   Location: Walnut Creek, CA Join Date: Jan 2014 Posts: 2,707 The estimate is provided for convenience, but the accuracy is not guaranteed since it has to guess the peak locations, peak limits, and ploidy of the organism. It can be somewhat more accurate if you know the ploidy, in which case you can tell it the ploidy, but it still has to guess which peak is the homozygous peak, etc. I find that I can usually do a better job manually by eyeballing the peaks (in conjunction with the exact peak volumes from peaks.txt). Can you post an image of the graph, and the contents of peaks.txt?
 07-10-2017, 12:48 PM #5 horvathdp Member   Location: Fargo Join Date: Dec 2011 Posts: 64 Thanks!!! Here is a link to the data: https://drive.google.com/file/d/0Byp...ew?usp=sharing. I didn't know I could define the ploidy level. Can you share a hypothetical script with that option indicated?
 07-10-2017, 03:21 PM #6 Brian Bushnell Super Moderator   Location: Walnut Creek, CA Join Date: Jan 2014 Posts: 2,707 Visually, that looks like a very obvious haploid to me, with median coverage at ~16x and genome size of ~700Mbp for the unique stuff (using SUM(B8:B28) from your spreadsheet) and a couple hundred Mbp for repetitive stuff (sum of e.g. B29*\$A\$18/A31 for subsequent lines), so I'd say that KmerCountExact's result is correct for both ploidy and genome size. You can specify the ploidy by adding the flag "ploidy=1" or similar: kmercountexact.sh in=reads.fq poloidy=1 khist=khist.txt peaks=peaks.txt Can you describe the Jellyfish procedure in more detail, or link to it? Also, what kmer length are you using, how long are your reads, and what do you expect for the genome ploidy? Sometimes you can't tell apart a haploid from a diploid that has an extremely high het rate (SNPs are denser than kmer length) but I think that's uncommon in most organisms, and anyway, that would make the genome look bigger. It's also possible that you have a diploid with an extremely low het rate. Could it be that you have a super-inbred sample? If the organism has no (or very few) heterozygous sites, it would be impossible to tell a 1.9Gbp diploid (1.9Gbp per nucleus, meaning 950Mbp haploid size) from a 950Mbp haploid based on the kmer frequency histogram; they'd look identical. Last edited by Brian Bushnell; 07-10-2017 at 03:27 PM.
 07-12-2017, 06:15 AM #7 horvathdp Member   Location: Fargo Join Date: Dec 2011 Posts: 64 Interesting. The genome size as estimated by flow cytometry was 2.1 Gb. The organism has been reasonably well studied cytogenetically (it is one of the worst invasive weeds in the great plains). It is most definitely a hexaploid - though the data suggests that the duplication of one of the genomes was very recent and the third genome is from a very closely related species. I ran a SNP analyses, and granted that although I was being very conservative, I only identified about 150K SNPs after mapping the fragments back to the assembly. Thus my first thought was that there was almost no heterozygosity, and the three genomes were so closely related that the kmers were collapsing the sequence. The Jellyfish method as I understand it is described here: http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html Essentially you just sum the non-error kmers and divide by the coverage. The Jellyfish run was done with a kmer of 21, and the BBMAP with the default of 31(?). However, I am about to rerun the BBMAP using the same Kmer size (and with the ploidy level specified). It doesn't take long, so I will also run at a number of different Kmer sizes just to see what that does to the curve. I found that the genome size estimates from BBMAP were between 840 Mb (K=21, ploidy=6) to 1.1Gb (default). I guess my primary question is: Which method ((kmer count)/coverage ) or BBMAP) is the best for estimating genome size and why?
 07-12-2017, 10:00 AM #8 Brian Bushnell Super Moderator   Location: Walnut Creek, CA Join Date: Jan 2014 Posts: 2,707 If that's a hexaploid, then presuming the primary peak is the 6x homogeneous peak, the coverage is too low to see the haploid heterozygous peak, which would be at around 2.5x and indistinguishable from error kmers. Which makes it difficult to estimate the genome size. That said, if it's a hexaploid, and again presuming the primary peak is the 6x peak, that would indicate the total genome size (total number of DNA bp within each cell) is actually something like 6*950Mbp = 5.7Gbp which is much higher than your flow cytometry estimate. For a haploid or diploid, with sufficient coverage, kmer-based genome size estimation is extremely accurate. When the coverage is insufficient, or the ploidy becomes pretty high, or the peaks are indistinct, it's less accurate. But in this case I can't really resolve the discrepancies... if it were me, I'd try to unbundle the DNA and look at it with an electron microscope and get a pretty picture like this: Then it would be easy! But, more seriously - BBMap's KmerCountExact uses a default kmer length of 31. You can use longer or shorter kmers. Normally, the longer the kmer, the bigger the genome size estimate, since fewer short repeats are collapsed (collapsed short repeats often form a distinct peak which for genome size calculation will be multiplied by their multiplicity, but in this dataset there is only 1 peak). I guess the best thing I can do is just describe exactly how peaks.txt is generated. The kmer-frequency histogram is pretty straightforward. The choice of K is driven mainly by the length and error rate of your reads - given sufficient coverage, error-free reads, and perfectly uniform coverage distribution, the genome size estimate should theoretically be independent of K (though in practice it's obviously not). For low coverage as in this case a shorter K may be useful. The peaks file is generated by walking through the histogram and finding local maxima, then delimiting them by adjacent local minima. The volume is the sum of all unique kmers within that range. Once the peaks are defined (and in this case, there is only 1 clear peak... peaks.txt will probably report more peaks, but they are likely spurious and very small) the largest one is assumed to be the primary (homozygous) peak, though it depends on whether you set the ploidy flag. Assuming you don't set the ploidy flag, any peaks lower than the primary are assumed to be heterozygous peaks. The ploidy is then calculated by the ratio of the primary to the lowest peak - e.g., if the primary is at 60x and the lowest is 10x, the ploidy would be calculated as 6. Peaks higher than the primary are considered as replications. So, if the primary is at 60x and there is a smaller peak at 120x, and another at 180x, those would be considered 2x and 3x copy repeats. After 4x or so it all blurs together and the peaks cannot clearly be resolved. To calculate the genome size for a hexaploid, it uses: 1*(volume of the peak at 1/6 of the coverage of the primary) +2*(volume of the peak at 2/6 of the coverage of the primary) +3*(volume of the peak at 3/6 of the coverage of the primary) +4*(volume of the peak at 4/6 of the coverage of the primary) +5*(volume of the peak at 5/6 of the coverage of the primary) +6*(volume of the primary peak) +X*(volume of any higher peak that is X/6 of the coverage of the primary) That gives the total DNA in a cell. The haploid size would be that number divided by the ploidy. But in this case there is only 1 peak so the formula is not very useful. Also note that in practice, for a tetraploid with sufficient coverage (I have little experience with hexaploids), I typically see a peak at 1/4, 2/4, and 4/4, but never at 3/4. I'm not sure why; it should exist and be of equal magnitude to the 1/4 peak, assuming all variation is caused by SNPs. I guess this indicates insertion events. Last edited by Brian Bushnell; 07-12-2017 at 10:05 AM.
 07-12-2017, 10:15 AM #9 horvathdp Member   Location: Fargo Join Date: Dec 2011 Posts: 64 Your patience and kindness (not to mention the detailed response) is quite appreciated. Given all the assistance you have been over the years on this project, I feel I should list you as an author :-).
 07-12-2017, 10:43 AM #10 Brian Bushnell Super Moderator   Location: Walnut Creek, CA Join Date: Jan 2014 Posts: 2,707 I would appreciate it, but please don't feel obligated. As long as I'm cited, I'll be happy
 01-04-2018, 06:12 AM #11 niuyw Junior Member   Location: china Join Date: Feb 2016 Posts: 1 Dear all and Brian Bushnell, I'm using kmercountexact.sh to estimate the genome size of an insect. I tried 17 mer and 31 mer, but got different results. After reading all the discussions above, I still don't know how to do (perhaps due to my poor English...). Here are the codes and results: 1. k=17 Code: `kmercountexact.sh in=270B.0.fastq in2=270B.1.fastq k=17 khist=270.his out=270 peaks=270.peaks` and the 270.peaks: HTML Code: ```#k 17 #unique_kmers 502145545 #main_peak 47 #genome_size 574333111 #haploid_genome_size 574333111 #fold_coverage 24 #haploid_fold_coverage 24 #ploidy 1 #percent_repeat 86.388 #start center stop max volume 10 24 32 4706167 78178249 32 47 118 5933924 195017237 1097 1103 1256 2449 337060 1256 1259 1300 1819 77545 1300 1303 1368 1756 108489 1368 1372 1374 1548 8989 1374 1376 1391 1491 24859 1391 1394 1421 1487 42110 1421 1423 1435 1424 18973 1435 1443 1498 1413 80666 1498 1500 1517 1188 22777 1517 1543 3853 1167 1077482``` 2. k=31 Code: `kmercountexact.sh in=270B.0.fastq in2=270B.1.fastq khist=270.31mer.his out=270.31mer peaks=270.31mer.peaks` and the 270.31mer.peaks: HTML Code: ```#k 31 #unique_kmers 877006010 #main_peak 41 #genome_size 846756437 #haploid_genome_size 423378218 #fold_coverage 20 #haploid_fold_coverage 41 #ploidy 2 #het_rate 0.01748 #percent_repeat 13.342 #start center stop max volume 8 20 32 12998950 229457745 32 41 103 10571247 252163197 702 705 748 4040 175332 748 754 971 3523 588862 971 973 1002 2005 59786 1002 1007 1064 1870 108141 1064 1065 1121 1635 89229 1121 1123 1147 1515 37584 1147 1154 1189 1392 56941 1189 1198 1214 1327 32072 1214 1216 1295 1267 95219 1295 1304 3260 1131 896204``` As you can see, the genome size are very different and ploidy are different. The insect is diploid, and it seems that the ploidy predicted is more accurate when using 31 mer. I've also tried the Jellyfish way (http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html) using 17 mer and got 707955101 bp. But I'm little doublt, is it a haploid size or total size? Any suggestions would be appreciated. I'm totally stucked.
 01-04-2018, 06:30 AM #12 horvathdp Member   Location: Fargo Join Date: Dec 2011 Posts: 64 I am not an expert, but I would bet that when you used a kmer of 17, that it collapsed a bit of the genome because the program was not able to easily distinguish between kmers coming from 2 different alleles of the same gene. There would be a greater chance for two 31mers -each from a different allelic copies genes to show differences that could distinguish between the fragments. I would guess that Jellyfish might have done a better job of distinguishing differences between alleles, but it was still working with a fairly small kmer size. I'd bet that increasing the kmer size in Jellyfish will give a closer genome size estimate obtained from the 31kmer BBMAP run. I had a similar issue with my organism (a purported hexaploid weed), where the 3 genomes are so similar and had such little heterozygosity that even with a kmer of 35, the genomes all collapsed and I got a genome size estimate that was a bit over a third of what we got for a genome size for with flow cytometry, and it also came out as a haploid.
01-04-2018, 11:27 AM   #13
Brian Bushnell
Super Moderator

Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707

Quote:
 Originally Posted by horvathdp I am not an expert, but I would bet that when you used a kmer of 17, that it collapsed a bit of the genome because the program was not able to easily distinguish between kmers coming from 2 different alleles of the same gene.
I concur; 17 is really too short for this purpose. When trying to estimate genome size, it's important for most of the kmers to be unique (aside from long perfect repeats); so, kmer lengths greater than 20 are best. I normally choose 31 because it is the longest kmer that can be represented in a memory-efficient way.