Seqanswers Leaderboard Ad

**Brian Bushnell** · 11-06-2015, 10:38 AM

Hi!

Unfortunately, it looks like ploidy was estimated wrong; I don't think anything has a ploidy of 3. If the peak calls are correct, it's probably a highly polymorphic tetraploid, or a highly repetitive diploid or haploid. The full genome size would be approximately:

1*100786874 + 2*913749 + 4*41304813 + 8*1199939 + 10*138959 ...etc

Yielding 278822726, and so including all the peaks it would be around 285812914 as the program indicated. However the multipliers for the various peaks are hard to determine. Generally, the multiplier is the the peak's center divided by the first peak's center, so the first multiplier is 31/31=1, the second is 65/31=2. But the third is 112/31=3.6, which is hard to tell if it's 3 or 4 (they are all integers). It's probably 4.

Thus the genome is around 285Mbp all told, but it's hard to figure out the ploidy which makes it hard to figure out the haploid size. If it's a haploid then the genome is ~285MB. If it's tetraploid then you divide it by 4 and get 4 copies of a 71Mbp genome. For example, a human cell has around 6 Gbp of DNA in it, but because humans are diploid, the haploid representation of the genome is around 3 Gbp long.

The number of unique kmers is not important. Fold coverage is the amount of times you have sequenced each base, on average. Haploid genome size is the size of the genome divided by the ploidy. Haploid fold coverage is the number of times you have sequenced each base in the haploid representation, if the ploidy estimate was correct, which it isn't. Ploidy is the number of copies of each chromosome per cell. Het rate is an estimate of the fraction of the genome that is heterozygous (usually meaning a SNP); so 0.03413 indicates that 3.4% of the genome is heterozygous, meaning there is a SNP every 30 bases, which is very high. Percent repeat is the percent of the genome that occurs more than once as an exact copy; this number comes from analyzing the high-depth peaks, which come from repeat regions. However, if the ploidy estimate is off, then the het rate and percent repeat will also be off.

Usually the ploidy estimate works well for haploid and diploid organisms, but tetraploid and higher are difficult.

As for the post you pointed to, I was totally wrong about the formula, it should have been multiplication, not division. I will edit it.

**horvathdp** · 07-10-2017, 08:43 AM

Hi All (and particularly Brian Bushnell),
I am trying to estimate the genome size of my leafy spurge genome with BBMAP. The ouput says my genome size is a haploid of ~950MB. However, using a procedure described for Jellyfish kmer counts – (sum of (kmer count* the number of kmers))/the peak fold I get 4.6 Gb. The flow cytometry data we have for leafy spurge is 2.1Gb for the 1x genome. From the discussion above, it seems that the BBMAP program measures area under the curve for each peak to get it’s genome size estimate. Given the additional data from the flow cytometry results, I am inclined to believe the method described for the Jellyfish data, but I was wondering why the two methods give such different results, and if there was any reason to prefer the BBMAP method.

**Brian Bushnell** · 07-10-2017, 08:54 AM

The estimate is provided for convenience, but the accuracy is not guaranteed since it has to guess the peak locations, peak limits, and ploidy of the organism. It can be somewhat more accurate if you know the ploidy, in which case you can tell it the ploidy, but it still has to guess which peak is the homozygous peak, etc.

I find that I can usually do a better job manually by eyeballing the peaks (in conjunction with the exact peak volumes from peaks.txt). Can you post an image of the graph, and the contents of peaks.txt?

**horvathdp** · 07-10-2017, 11:48 AM

Thanks!!!

Here is a link to the data:
https://drive.google.com/file/d/0Byp...ew?usp=sharing. I didn't know I could define the ploidy level. Can you share a hypothetical script with that option indicated?

**Brian Bushnell** · 07-10-2017, 02:21 PM

Visually, that looks like a very obvious haploid to me, with median coverage at ~16x and genome size of ~700Mbp for the unique stuff (using SUM(B8:B28) from your spreadsheet) and a couple hundred Mbp for repetitive stuff (sum of e.g. B29*$A$18/A31 for subsequent lines), so I'd say that KmerCountExact's result is correct for both ploidy and genome size.

You can specify the ploidy by adding the flag "ploidy=1" or similar:

kmercountexact.sh in=reads.fq poloidy=1 khist=khist.txt peaks=peaks.txt

Can you describe the Jellyfish procedure in more detail, or link to it? Also, what kmer length are you using, how long are your reads, and what do you expect for the genome ploidy? Sometimes you can't tell apart a haploid from a diploid that has an extremely high het rate (SNPs are denser than kmer length) but I think that's uncommon in most organisms, and anyway, that would make the genome look bigger. It's also possible that you have a diploid with an extremely low het rate. Could it be that you have a super-inbred sample? If the organism has no (or very few) heterozygous sites, it would be impossible to tell a 1.9Gbp diploid (1.9Gbp per nucleus, meaning 950Mbp haploid size) from a 950Mbp haploid based on the kmer frequency histogram; they'd look identical.

**horvathdp** · 07-12-2017, 05:15 AM

Interesting. The genome size as estimated by flow cytometry was 2.1 Gb. The organism has been reasonably well studied cytogenetically (it is one of the worst invasive weeds in the great plains). It is most definitely a hexaploid - though the data suggests that the duplication of one of the genomes was very recent and the third genome is from a very closely related species. I ran a SNP analyses, and granted that although I was being very conservative, I only identified about 150K SNPs after mapping the fragments back to the assembly. Thus my first thought was that there was almost no heterozygosity, and the three genomes were so closely related that the kmers were collapsing the sequence. The Jellyfish method as I understand it is described here: http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html

Essentially you just sum the non-error kmers and divide by the coverage.

The Jellyfish run was done with a kmer of 21, and the BBMAP with the default of 31(?). However, I am about to rerun the BBMAP using the same Kmer size (and with the ploidy level specified). It doesn't take long, so I will also run at a number of different Kmer sizes just to see what that does to the curve.

I found that the genome size estimates from BBMAP were between 840 Mb (K=21, ploidy=6) to 1.1Gb (default).

I guess my primary question is: Which method ((kmer count)/coverage ) or BBMAP) is the best for estimating genome size and why?

**Brian Bushnell** · 07-12-2017, 09:00 AM

If that's a hexaploid, then presuming the primary peak is the 6x homogeneous peak, the coverage is too low to see the haploid heterozygous peak, which would be at around 2.5x and indistinguishable from error kmers. Which makes it difficult to estimate the genome size. That said, if it's a hexaploid, and again presuming the primary peak is the 6x peak, that would indicate the total genome size (total number of DNA bp within each cell) is actually something like 6*950Mbp = 5.7Gbp which is much higher than your flow cytometry estimate.

For a haploid or diploid, with sufficient coverage, kmer-based genome size estimation is extremely accurate. When the coverage is insufficient, or the ploidy becomes pretty high, or the peaks are indistinct, it's less accurate. But in this case I can't really resolve the discrepancies... if it were me, I'd try to unbundle the DNA and look at it with an electron microscope and get a pretty picture like this:

Then it would be easy!

But, more seriously -

BBMap's KmerCountExact uses a default kmer length of 31. You can use longer or shorter kmers. Normally, the longer the kmer, the bigger the genome size estimate, since fewer short repeats are collapsed (collapsed short repeats often form a distinct peak which for genome size calculation will be multiplied by their multiplicity, but in this dataset there is only 1 peak).

I guess the best thing I can do is just describe exactly how peaks.txt is generated. The kmer-frequency histogram is pretty straightforward. The choice of K is driven mainly by the length and error rate of your reads - given sufficient coverage, error-free reads, and perfectly uniform coverage distribution, the genome size estimate should theoretically be independent of K (though in practice it's obviously not). For low coverage as in this case a shorter K may be useful.

The peaks file is generated by walking through the histogram and finding local maxima, then delimiting them by adjacent local minima. The volume is the sum of all unique kmers within that range. Once the peaks are defined (and in this case, there is only 1 clear peak... peaks.txt will probably report more peaks, but they are likely spurious and very small) the largest one is assumed to be the primary (homozygous) peak, though it depends on whether you set the ploidy flag. Assuming you don't set the ploidy flag, any peaks lower than the primary are assumed to be heterozygous peaks. The ploidy is then calculated by the ratio of the primary to the lowest peak - e.g., if the primary is at 60x and the lowest is 10x, the ploidy would be calculated as 6. Peaks higher than the primary are considered as replications. So, if the primary is at 60x and there is a smaller peak at 120x, and another at 180x, those would be considered 2x and 3x copy repeats. After 4x or so it all blurs together and the peaks cannot clearly be resolved.

To calculate the genome size for a hexaploid, it uses:

1*(volume of the peak at 1/6 of the coverage of the primary)
+2*(volume of the peak at 2/6 of the coverage of the primary)
+3*(volume of the peak at 3/6 of the coverage of the primary)
+4*(volume of the peak at 4/6 of the coverage of the primary)
+5*(volume of the peak at 5/6 of the coverage of the primary)
+6*(volume of the primary peak)
+X*(volume of any higher peak that is X/6 of the coverage of the primary)

That gives the total DNA in a cell. The haploid size would be that number divided by the ploidy. But in this case there is only 1 peak so the formula is not very useful. Also note that in practice, for a tetraploid with sufficient coverage (I have little experience with hexaploids), I typically see a peak at 1/4, 2/4, and 4/4, but never at 3/4. I'm not sure why; it should exist and be of equal magnitude to the 1/4 peak, assuming all variation is caused by SNPs. I guess this indicates insertion events.

**horvathdp** · 07-12-2017, 09:15 AM

Your patience and kindness (not to mention the detailed response) is quite appreciated. Given all the assistance you have been over the years on this project, I feel I should list you as an author :-).

**Brian Bushnell** · 07-12-2017, 09:43 AM

I would appreciate it, but please don't feel obligated. As long as I'm cited, I'll be happy

**niuyw** · 01-04-2018, 06:12 AM

Dear all and Brian Bushnell,

I'm using kmercountexact.sh to estimate the genome size of an insect. I tried 17 mer and 31 mer, but got different results. After reading all the discussions above, I still don't know how to do (perhaps due to my poor English...). Here are the codes and results:

1. k=17

Code:

kmercountexact.sh in=270B.0.fastq in2=270B.1.fastq k=17 khist=270.his out=270 peaks=270.peaks

and the 270.peaks:

HTML Code:

#k	17
#unique_kmers	502145545
#main_peak	47
#genome_size	574333111
#haploid_genome_size	574333111
#fold_coverage	24
#haploid_fold_coverage	24
#ploidy	1
#percent_repeat	86.388
#start	center	stop	max	volume
10	24	32	4706167	78178249	
32	47	118	5933924	195017237	
1097	1103	1256	2449	337060	
1256	1259	1300	1819	77545	
1300	1303	1368	1756	108489	
1368	1372	1374	1548	8989	
1374	1376	1391	1491	24859	
1391	1394	1421	1487	42110	
1421	1423	1435	1424	18973	
1435	1443	1498	1413	80666	
1498	1500	1517	1188	22777	
1517	1543	3853	1167	1077482

2. k=31

Code:

kmercountexact.sh in=270B.0.fastq in2=270B.1.fastq khist=270.31mer.his out=270.31mer peaks=270.31mer.peaks

and the 270.31mer.peaks:

HTML Code:

#k	31
#unique_kmers	877006010
#main_peak	41
#genome_size	846756437
#haploid_genome_size	423378218
#fold_coverage	20
#haploid_fold_coverage	41
#ploidy	2
#het_rate	0.01748
#percent_repeat	13.342
#start	center	stop	max	volume
8	20	32	12998950	229457745	
32	41	103	10571247	252163197	
702	705	748	4040	175332	
748	754	971	3523	588862	
971	973	1002	2005	59786	
1002	1007	1064	1870	108141	
1064	1065	1121	1635	89229	
1121	1123	1147	1515	37584	
1147	1154	1189	1392	56941	
1189	1198	1214	1327	32072	
1214	1216	1295	1267	95219	
1295	1304	3260	1131	896204

As you can see, the genome size are very different and ploidy are different. The insect is diploid, and it seems that the ploidy predicted is more accurate when using 31 mer.

I've also tried the Jellyfish way (http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html) using 17 mer and got 707955101 bp. But I'm little doublt, is it a haploid size or total size?

Any suggestions would be appreciated. I'm totally stucked.

**horvathdp** · 01-04-2018, 06:30 AM

I am not an expert, but I would bet that when you used a kmer of 17, that it collapsed a bit of the genome because the program was not able to easily distinguish between kmers coming from 2 different alleles of the same gene. There would be a greater chance for two 31mers -each from a different allelic copies genes to show differences that could distinguish between the fragments. I would guess that Jellyfish might have done a better job of distinguishing differences between alleles, but it was still working with a fairly small kmer size. I'd bet that increasing the kmer size in Jellyfish will give a closer genome size estimate obtained from the 31kmer BBMAP run. I had a similar issue with my organism (a purported hexaploid weed), where the 3 genomes are so similar and had such little heterozygosity that even with a kmer of 35, the genomes all collapsed and I got a genome size estimate that was a bit over a third of what we got for a genome size for with flow cytometry, and it also came out as a haploid.

**Brian Bushnell** · 01-04-2018, 11:27 AM

Originally posted by horvathdp View Post

I am not an expert, but I would bet that when you used a kmer of 17, that it collapsed a bit of the genome because the program was not able to easily distinguish between kmers coming from 2 different alleles of the same gene.

I concur; 17 is really too short for this purpose. When trying to estimate genome size, it's important for most of the kmers to be unique (aside from long perfect repeats); so, kmer lengths greater than 20 are best. I normally choose 31 because it is the longest kmer that can be represented in a memory-efficient way.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Genome size estimation using BBMAP

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News