Seqanswers Leaderboard Ad

**Brian Bushnell** · 02-19-2015, 10:55 AM

You can do that with BBTools like this:

kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt

You can examine the histogram manually, or use the "peaks" file which tells you the number of unique kmers in each peak on the histogram. For a diploid, the first peak will be the het peak, the second will be the homozygous peak, and the rest will be repeat peaks. The peak caller is not perfect, though, so particularly with noisy data I would only rely on it for the first two peaks, and try to quantify the higher-order peaks manually if you need to (which you generally don't).

I think AllPaths-LG will report het rate, also.

**cvargasc** · 07-08-2015, 05:05 AM

Dear Brian,

Thank you for developing such an useful an versatile tool. I was wondering how to interpretate both files to gain insight into the heterozygosity. My peaks file was generated using information from a highly heterozygous diploid organism, however there is a third peak with a greater volume than any other, why could this be happening?

Thanks a lot!

#start center stop max volume
10 14 18 4633863 36855138
18 29 36 4913455 85947645
36 59 103 12178207 347426111
103 105 106 324995 972461
106 114 228 331479 18667384
228 229 230 56194 111376
230 231 234 55293 217949
234 235 99931 54052 12688071

**Brian Bushnell** · 07-08-2015, 10:00 AM

There are several possible explanations for this, such as organelles, contamination, a symbiont, or mozaicism. But in this case, noting that the peaks are centered at 14, 29, and 59, it very much looks like the organism is tetraploid, or was tetraploid in the past and is now diploid but each pair of chromosomes has a very similar additional pair. How confident are you that it's diploid?

Organelles is unlikely because they generally have a small, very high copy peak. You can check for contamination/symbionts by plotting a gc-content histogram of the raw reads (reformat.sh in=reads.fq gchist=gchist.txt gcbins=auto) or assembling, mapping, and plotting gc versus coverage (this can be generated by BBMap with the covstats output) or plotting a PCA of tetramer frequencies of the contigs.

The latest version of BBTools (35.14) does smoothing of the kmer histogram prior to calling peaks, which greatly increases the accuracy, particularly for high-copy-count peaks.

**cvargasc** · 07-08-2015, 10:45 AM

Thank you for your prompt reply! A karyotype was performed and it was determined that the organism is diploid, however the possibility of a recent duplication is not difficult to imagine.
Regarding symbionts or organelles, we would only expect the mitochondria, we are quite confident that this sample is symbiont free.
Could you please elaborate into why you think this sample is tetraploid? Also, how could I quantify heterozygosity using the output?
Thanks!

**Brian Bushnell** · 07-08-2015, 11:01 AM

Typically, for a polyploid organism, the peak with the largest volume (or the major peak with the greatest depth) represents the homozygous portion of the genome. For a diploid organism with 60x coverage, you would expect two peaks - one at 60x (the homozygous peak) and one at 30x (the heterozygous peak). As long as the het rate is not super high, the homozygous peak will be bigger; though with an extremely high het rate, the heterozygous peak could be bigger. But, instead of homozygous and heterozygous peaks, let's call them the 1-copy and 2-copy peaks.

With a tetraploid, you would expect 4 peaks - a 1-copy, 2-copy, 3-copy, and 4-copy. The 4-copy will generally be the largest. If you have 60x coverage (of the haploid genome), then the peaks will be at 15, 30, 45, and 60. Presumably, if you look at the kmer histogram, this dataset should have a 3-copy peak at depth 45, but it was probably merged into the 60-depth peak.

So, generally, if you ever have the biggest peak at depth N, and have smaller peaks at N/4, 2N/4, and ideally 3N/4, you are probably looking at a tetraploid.

Oh... as for calculating the het rate, you can estimate that the genome size is 1*(volume of 1-copy peak)+2*(volume of 2-copy peak), etc. The latest version of BBMap will do this automatically and report the genome size in the header of the peaks file. That's the number of base pairs in each cell of the organism; if you want the haploid genome size, then for a tetraploid, you'd divide that number by 4. So:

(1*36855138+2*85947645+4*347426111)/4=399,613,718 - roughly 400Mbp.

To get the het rate, if you assume that all the differences between ploidy is just SNPs (a useful simplifying assumption), then since KmerCountExact uses 31-mers, each 1-copy SNP would be expected to produce 31 kmers in the 1-copy peak. Therefore, the number of single-copy SNPs would be the volume of the 1-copy peak divided by 31, which is 1188875. The number of 2-copy SNPs would be 2-copy peak over 31, or 2772504. So the total number of SNP locations is probably around 3,961,379. Therefore the het rate is something like 1/100.

**cvargasc** · 07-09-2015, 06:52 AM

Thanks a lot for your explanations, it is all clear now! And once more thank you for developing BBTools.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Heterozygous rate from "de novo" sequencing

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News