syintel87 11-06-2015 09:12 AM

Genome size estimation using BBMAP

Hi all,
I have 21,005,534 x 2 paired-end genome sequencing data.
To estimate the genome size, I used BBMAP.
Is there anyone who could help to interpret the output?

The command:
Code:

`./kmercountexact.sh in1=read1.fastq in2=read2.fastq khist=khist.txt peaks=peaks.txt`
The 'peaks.txt' output:
Quote:
 #k 31 #unique_kmers 1432946339 #main_peak 31 #genome_size 285812914 #haploid_genome_size 95270971 #fold_coverage 31 #haploid_fold_coverage 112 #ploidy 3 #het_rate 0.03413 #percent_repeat 6.291 #start center stop max volume 13 31 58 6211343 100786874 58 65 71 73206 913749 71 112 170 1196232 41304813 170 224 285 19062 1199939 285 324 325 3981 138959 325 329 396 4084 201426 396 414 443 1692 75732 443 447 525 1481 92625 525 526 535 833 7869 535 539 541 822 4724 541 543 557 797 12164 557 565 1413 725 181119
Q1.
to calculate the genome size, 100786874 + 913749/2 + 41304813/3 + ... + 12164/11 + 181119/12 = 144,919,993.
However, on the 4th line of the output, the genome size is 285,812,914.
What is the real genome size?

Q2.
Also, what does each line of the output mean (e.g. unique_kmers, haploid_genome_size, haploid_fold_coverage, ploidy, het_rate, percent_repeat)?

 Brian Bushnell 11-06-2015 10:38 AM

Hi!

Unfortunately, it looks like ploidy was estimated wrong; I don't think anything has a ploidy of 3. If the peak calls are correct, it's probably a highly polymorphic tetraploid, or a highly repetitive diploid or haploid. The full genome size would be approximately:

1*100786874 + 2*913749 + 4*41304813 + 8*1199939 + 10*138959 ...etc

Yielding 278822726, and so including all the peaks it would be around 285812914 as the program indicated. However the multipliers for the various peaks are hard to determine. Generally, the multiplier is the the peak's center divided by the first peak's center, so the first multiplier is 31/31=1, the second is 65/31=2. But the third is 112/31=3.6, which is hard to tell if it's 3 or 4 (they are all integers). It's probably 4.

Thus the genome is around 285Mbp all told, but it's hard to figure out the ploidy which makes it hard to figure out the haploid size. If it's a haploid then the genome is ~285MB. If it's tetraploid then you divide it by 4 and get 4 copies of a 71Mbp genome. For example, a human cell has around 6 Gbp of DNA in it, but because humans are diploid, the haploid representation of the genome is around 3 Gbp long.

The number of unique kmers is not important. Fold coverage is the amount of times you have sequenced each base, on average. Haploid genome size is the size of the genome divided by the ploidy. Haploid fold coverage is the number of times you have sequenced each base in the haploid representation, if the ploidy estimate was correct, which it isn't. Ploidy is the number of copies of each chromosome per cell. Het rate is an estimate of the fraction of the genome that is heterozygous (usually meaning a SNP); so 0.03413 indicates that 3.4% of the genome is heterozygous, meaning there is a SNP every 30 bases, which is very high. Percent repeat is the percent of the genome that occurs more than once as an exact copy; this number comes from analyzing the high-depth peaks, which come from repeat regions. However, if the ploidy estimate is off, then the het rate and percent repeat will also be off.

Usually the ploidy estimate works well for haploid and diploid organisms, but tetraploid and higher are difficult.

As for the post you pointed to, I was totally wrong about the formula, it should have been multiplication, not division. I will edit it.

 horvathdp 07-10-2017 09:43 AM

Hi All (and particularly Brian Bushnell),
I am trying to estimate the genome size of my leafy spurge genome with BBMAP. The ouput says my genome size is a haploid of ~950MB. However, using a procedure described for Jellyfish kmer counts – (sum of (kmer count* the number of kmers))/the peak fold I get 4.6 Gb. The flow cytometry data we have for leafy spurge is 2.1Gb for the 1x genome. From the discussion above, it seems that the BBMAP program measures area under the curve for each peak to get it’s genome size estimate. Given the additional data from the flow cytometry results, I am inclined to believe the method described for the Jellyfish data, but I was wondering why the two methods give such different results, and if there was any reason to prefer the BBMAP method.

 Brian Bushnell 07-10-2017 09:54 AM

The estimate is provided for convenience, but the accuracy is not guaranteed since it has to guess the peak locations, peak limits, and ploidy of the organism. It can be somewhat more accurate if you know the ploidy, in which case you can tell it the ploidy, but it still has to guess which peak is the homozygous peak, etc.

I find that I can usually do a better job manually by eyeballing the peaks (in conjunction with the exact peak volumes from peaks.txt). Can you post an image of the graph, and the contents of peaks.txt?

 horvathdp 07-10-2017 12:48 PM

Thanks!!!

Here is a link to the data:
https://drive.google.com/file/d/0Byp...ew?usp=sharing. I didn't know I could define the ploidy level. Can you share a hypothetical script with that option indicated?

 Brian Bushnell 07-10-2017 03:21 PM

Visually, that looks like a very obvious haploid to me, with median coverage at ~16x and genome size of ~700Mbp for the unique stuff (using SUM(B8:B28) from your spreadsheet) and a couple hundred Mbp for repetitive stuff (sum of e.g. B29*\$A\$18/A31 for subsequent lines), so I'd say that KmerCountExact's result is correct for both ploidy and genome size.

You can specify the ploidy by adding the flag "ploidy=1" or similar:

Can you describe the Jellyfish procedure in more detail, or link to it? Also, what kmer length are you using, how long are your reads, and what do you expect for the genome ploidy? Sometimes you can't tell apart a haploid from a diploid that has an extremely high het rate (SNPs are denser than kmer length) but I think that's uncommon in most organisms, and anyway, that would make the genome look bigger. It's also possible that you have a diploid with an extremely low het rate. Could it be that you have a super-inbred sample? If the organism has no (or very few) heterozygous sites, it would be impossible to tell a 1.9Gbp diploid (1.9Gbp per nucleus, meaning 950Mbp haploid size) from a 950Mbp haploid based on the kmer frequency histogram; they'd look identical.

 horvathdp 07-12-2017 06:15 AM

Interesting. The genome size as estimated by flow cytometry was 2.1 Gb. The organism has been reasonably well studied cytogenetically (it is one of the worst invasive weeds in the great plains). It is most definitely a hexaploid - though the data suggests that the duplication of one of the genomes was very recent and the third genome is from a very closely related species. I ran a SNP analyses, and granted that although I was being very conservative, I only identified about 150K SNPs after mapping the fragments back to the assembly. Thus my first thought was that there was almost no heterozygosity, and the three genomes were so closely related that the kmers were collapsing the sequence. The Jellyfish method as I understand it is described here: http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html

Essentially you just sum the non-error kmers and divide by the coverage.

The Jellyfish run was done with a kmer of 21, and the BBMAP with the default of 31(?). However, I am about to rerun the BBMAP using the same Kmer size (and with the ploidy level specified). It doesn't take long, so I will also run at a number of different Kmer sizes just to see what that does to the curve.

I found that the genome size estimates from BBMAP were between 840 Mb (K=21, ploidy=6) to 1.1Gb (default).

I guess my primary question is: Which method ((kmer count)/coverage ) or BBMAP) is the best for estimating genome size and why?

 Brian Bushnell 07-12-2017 10:00 AM

If that's a hexaploid, then presuming the primary peak is the 6x homogeneous peak, the coverage is too low to see the haploid heterozygous peak, which would be at around 2.5x and indistinguishable from error kmers. Which makes it difficult to estimate the genome size. That said, if it's a hexaploid, and again presuming the primary peak is the 6x peak, that would indicate the total genome size (total number of DNA bp within each cell) is actually something like 6*950Mbp = 5.7Gbp which is much higher than your flow cytometry estimate.

For a haploid or diploid, with sufficient coverage, kmer-based genome size estimation is extremely accurate. When the coverage is insufficient, or the ploidy becomes pretty high, or the peaks are indistinct, it's less accurate. But in this case I can't really resolve the discrepancies... if it were me, I'd try to unbundle the DNA and look at it with an electron microscope and get a pretty picture like this:

Then it would be easy!

But, more seriously -

BBMap's KmerCountExact uses a default kmer length of 31. You can use longer or shorter kmers. Normally, the longer the kmer, the bigger the genome size estimate, since fewer short repeats are collapsed (collapsed short repeats often form a distinct peak which for genome size calculation will be multiplied by their multiplicity, but in this dataset there is only 1 peak).

I guess the best thing I can do is just describe exactly how peaks.txt is generated. The kmer-frequency histogram is pretty straightforward. The choice of K is driven mainly by the length and error rate of your reads - given sufficient coverage, error-free reads, and perfectly uniform coverage distribution, the genome size estimate should theoretically be independent of K (though in practice it's obviously not). For low coverage as in this case a shorter K may be useful.

The peaks file is generated by walking through the histogram and finding local maxima, then delimiting them by adjacent local minima. The volume is the sum of all unique kmers within that range. Once the peaks are defined (and in this case, there is only 1 clear peak... peaks.txt will probably report more peaks, but they are likely spurious and very small) the largest one is assumed to be the primary (homozygous) peak, though it depends on whether you set the ploidy flag. Assuming you don't set the ploidy flag, any peaks lower than the primary are assumed to be heterozygous peaks. The ploidy is then calculated by the ratio of the primary to the lowest peak - e.g., if the primary is at 60x and the lowest is 10x, the ploidy would be calculated as 6. Peaks higher than the primary are considered as replications. So, if the primary is at 60x and there is a smaller peak at 120x, and another at 180x, those would be considered 2x and 3x copy repeats. After 4x or so it all blurs together and the peaks cannot clearly be resolved.

To calculate the genome size for a hexaploid, it uses:

1*(volume of the peak at 1/6 of the coverage of the primary)
+2*(volume of the peak at 2/6 of the coverage of the primary)
+3*(volume of the peak at 3/6 of the coverage of the primary)
+4*(volume of the peak at 4/6 of the coverage of the primary)
+5*(volume of the peak at 5/6 of the coverage of the primary)
+6*(volume of the primary peak)
+X*(volume of any higher peak that is X/6 of the coverage of the primary)

That gives the total DNA in a cell. The haploid size would be that number divided by the ploidy. But in this case there is only 1 peak so the formula is not very useful. Also note that in practice, for a tetraploid with sufficient coverage (I have little experience with hexaploids), I typically see a peak at 1/4, 2/4, and 4/4, but never at 3/4. I'm not sure why; it should exist and be of equal magnitude to the 1/4 peak, assuming all variation is caused by SNPs. I guess this indicates insertion events.

 horvathdp 07-12-2017 10:15 AM

Your patience and kindness (not to mention the detailed response) is quite appreciated. Given all the assistance you have been over the years on this project, I feel I should list you as an author :-).

 Brian Bushnell 07-12-2017 10:43 AM

I would appreciate it, but please don't feel obligated. As long as I'm cited, I'll be happy :)

 niuyw 01-04-2018 06:12 AM

Dear all and Brian Bushnell,

I'm using kmercountexact.sh to estimate the genome size of an insect. I tried 17 mer and 31 mer, but got different results. After reading all the discussions above, I still don't know how to do (perhaps due to my poor English...). Here are the codes and results:

1. k=17
Code:

`kmercountexact.sh in=270B.0.fastq in2=270B.1.fastq k=17 khist=270.his out=270 peaks=270.peaks`
and the 270.peaks:
HTML Code:

```#k        17 #unique_kmers        502145545 #main_peak        47 #genome_size        574333111 #haploid_genome_size        574333111 #fold_coverage        24 #haploid_fold_coverage        24 #ploidy        1 #percent_repeat        86.388 #start        center        stop        max        volume 10        24        32        4706167        78178249        32        47        118        5933924        195017237        1097        1103        1256        2449        337060        1256        1259        1300        1819        77545        1300        1303        1368        1756        108489        1368        1372        1374        1548        8989        1374        1376        1391        1491        24859        1391        1394        1421        1487        42110        1421        1423        1435        1424        18973        1435        1443        1498        1413        80666        1498        1500        1517        1188        22777        1517        1543        3853        1167        1077482```
2. k=31
Code:

`kmercountexact.sh in=270B.0.fastq in2=270B.1.fastq khist=270.31mer.his out=270.31mer peaks=270.31mer.peaks`
and the 270.31mer.peaks:
HTML Code:

```#k        31 #unique_kmers        877006010 #main_peak        41 #genome_size        846756437 #haploid_genome_size        423378218 #fold_coverage        20 #haploid_fold_coverage        41 #ploidy        2 #het_rate        0.01748 #percent_repeat        13.342 #start        center        stop        max        volume 8        20        32        12998950        229457745        32        41        103        10571247        252163197        702        705        748        4040        175332        748        754        971        3523        588862        971        973        1002        2005        59786        1002        1007        1064        1870        108141        1064        1065        1121        1635        89229        1121        1123        1147        1515        37584        1147        1154        1189        1392        56941        1189        1198        1214        1327        32072        1214        1216        1295        1267        95219        1295        1304        3260        1131        896204```
As you can see, the genome size are very different and ploidy are different. The insect is diploid, and it seems that the ploidy predicted is more accurate when using 31 mer.

I've also tried the Jellyfish way (http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html) using 17 mer and got 707955101 bp. But I'm little doublt, is it a haploid size or total size?

Any suggestions would be appreciated. I'm totally stucked.

 horvathdp 01-04-2018 06:30 AM

I am not an expert, but I would bet that when you used a kmer of 17, that it collapsed a bit of the genome because the program was not able to easily distinguish between kmers coming from 2 different alleles of the same gene. There would be a greater chance for two 31mers -each from a different allelic copies genes to show differences that could distinguish between the fragments. I would guess that Jellyfish might have done a better job of distinguishing differences between alleles, but it was still working with a fairly small kmer size. I'd bet that increasing the kmer size in Jellyfish will give a closer genome size estimate obtained from the 31kmer BBMAP run. I had a similar issue with my organism (a purported hexaploid weed), where the 3 genomes are so similar and had such little heterozygosity that even with a kmer of 35, the genomes all collapsed and I got a genome size estimate that was a bit over a third of what we got for a genome size for with flow cytometry, and it also came out as a haploid.

 Brian Bushnell 01-04-2018 11:27 AM

Quote:
 Originally Posted by horvathdp (Post 213830) I am not an expert, but I would bet that when you used a kmer of 17, that it collapsed a bit of the genome because the program was not able to easily distinguish between kmers coming from 2 different alleles of the same gene.
I concur; 17 is really too short for this purpose. When trying to estimate genome size, it's important for most of the kmers to be unique (aside from long perfect repeats); so, kmer lengths greater than 20 are best. I normally choose 31 because it is the longest kmer that can be represented in a memory-efficient way.

 All times are GMT -8. The time now is 08:36 PM.