SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genome size estimation for paired end and mate paired data mandar.bobade60 General 1 02-10-2015 11:09 PM
Genome size estimation for paired end data mandar.bobade60 General 4 11-19-2014 08:56 PM
Genome size estimation-jellyfish bioman1 Bioinformatics 3 08-18-2014 12:14 PM
Genome size estimation moinul De novo discovery 9 04-04-2014 04:22 AM
Inquiry regarding input file for genome size estimation by Jellyfish pcmount87 Bioinformatics 0 10-20-2013 08:38 AM

Reply
 
Thread Tools
Old 11-06-2015, 09:12 AM   #1
syintel87
Member
 
Location: Universe

Join Date: Dec 2012
Posts: 81
Default Genome size estimation using BBMAP

Hi all,
I have 21,005,534 x 2 paired-end genome sequencing data.
To estimate the genome size, I used BBMAP.
Is there anyone who could help to interpret the output?

The command:
Code:
./kmercountexact.sh in1=read1.fastq in2=read2.fastq khist=khist.txt peaks=peaks.txt
The 'peaks.txt' output:
Quote:
#k 31
#unique_kmers 1432946339
#main_peak 31
#genome_size 285812914
#haploid_genome_size 95270971
#fold_coverage 31
#haploid_fold_coverage 112
#ploidy 3
#het_rate 0.03413
#percent_repeat 6.291
#start center stop max volume
13 31 58 6211343 100786874
58 65 71 73206 913749
71 112 170 1196232 41304813
170 224 285 19062 1199939
285 324 325 3981 138959
325 329 396 4084 201426
396 414 443 1692 75732
443 447 525 1481 92625
525 526 535 833 7869
535 539 541 822 4724
541 543 557 797 12164
557 565 1413 725 181119
Q1.
According to this post (http://seqanswers.com/forums/archive...p/t-48375.html),
to calculate the genome size, 100786874 + 913749/2 + 41304813/3 + ... + 12164/11 + 181119/12 = 144,919,993.
However, on the 4th line of the output, the genome size is 285,812,914.
What is the real genome size?

Q2.
Also, what does each line of the output mean (e.g. unique_kmers, haploid_genome_size, haploid_fold_coverage, ploidy, het_rate, percent_repeat)?

Any advice would be appreciated.
syintel87 is offline   Reply With Quote
Old 11-06-2015, 10:38 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi!

Unfortunately, it looks like ploidy was estimated wrong; I don't think anything has a ploidy of 3. If the peak calls are correct, it's probably a highly polymorphic tetraploid, or a highly repetitive diploid or haploid. The full genome size would be approximately:

1*100786874 + 2*913749 + 4*41304813 + 8*1199939 + 10*138959 ...etc

Yielding 278822726, and so including all the peaks it would be around 285812914 as the program indicated. However the multipliers for the various peaks are hard to determine. Generally, the multiplier is the the peak's center divided by the first peak's center, so the first multiplier is 31/31=1, the second is 65/31=2. But the third is 112/31=3.6, which is hard to tell if it's 3 or 4 (they are all integers). It's probably 4.

Thus the genome is around 285Mbp all told, but it's hard to figure out the ploidy which makes it hard to figure out the haploid size. If it's a haploid then the genome is ~285MB. If it's tetraploid then you divide it by 4 and get 4 copies of a 71Mbp genome. For example, a human cell has around 6 Gbp of DNA in it, but because humans are diploid, the haploid representation of the genome is around 3 Gbp long.

The number of unique kmers is not important. Fold coverage is the amount of times you have sequenced each base, on average. Haploid genome size is the size of the genome divided by the ploidy. Haploid fold coverage is the number of times you have sequenced each base in the haploid representation, if the ploidy estimate was correct, which it isn't. Ploidy is the number of copies of each chromosome per cell. Het rate is an estimate of the fraction of the genome that is heterozygous (usually meaning a SNP); so 0.03413 indicates that 3.4% of the genome is heterozygous, meaning there is a SNP every 30 bases, which is very high. Percent repeat is the percent of the genome that occurs more than once as an exact copy; this number comes from analyzing the high-depth peaks, which come from repeat regions. However, if the ploidy estimate is off, then the het rate and percent repeat will also be off.

Usually the ploidy estimate works well for haploid and diploid organisms, but tetraploid and higher are difficult.

As for the post you pointed to, I was totally wrong about the formula, it should have been multiplication, not division. I will edit it.

Last edited by Brian Bushnell; 11-06-2015 at 10:41 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-10-2017, 09:43 AM   #3
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 60
Default

Hi All (and particularly Brian Bushnell),
I am trying to estimate the genome size of my leafy spurge genome with BBMAP. The ouput says my genome size is a haploid of ~950MB. However, using a procedure described for Jellyfish kmer counts – (sum of (kmer count* the number of kmers))/the peak fold I get 4.6 Gb. The flow cytometry data we have for leafy spurge is 2.1Gb for the 1x genome. From the discussion above, it seems that the BBMAP program measures area under the curve for each peak to get it’s genome size estimate. Given the additional data from the flow cytometry results, I am inclined to believe the method described for the Jellyfish data, but I was wondering why the two methods give such different results, and if there was any reason to prefer the BBMAP method.
horvathdp is offline   Reply With Quote
Old 07-10-2017, 09:54 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

The estimate is provided for convenience, but the accuracy is not guaranteed since it has to guess the peak locations, peak limits, and ploidy of the organism. It can be somewhat more accurate if you know the ploidy, in which case you can tell it the ploidy, but it still has to guess which peak is the homozygous peak, etc.

I find that I can usually do a better job manually by eyeballing the peaks (in conjunction with the exact peak volumes from peaks.txt). Can you post an image of the graph, and the contents of peaks.txt?
Brian Bushnell is offline   Reply With Quote
Old 07-10-2017, 12:48 PM   #5
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 60
Default

Thanks!!!

Here is a link to the data:
https://drive.google.com/file/d/0Byp...ew?usp=sharing. I didn't know I could define the ploidy level. Can you share a hypothetical script with that option indicated?
horvathdp is offline   Reply With Quote
Old 07-10-2017, 03:21 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Visually, that looks like a very obvious haploid to me, with median coverage at ~16x and genome size of ~700Mbp for the unique stuff (using SUM(B8:B28) from your spreadsheet) and a couple hundred Mbp for repetitive stuff (sum of e.g. B29*$A$18/A31 for subsequent lines), so I'd say that KmerCountExact's result is correct for both ploidy and genome size.

You can specify the ploidy by adding the flag "ploidy=1" or similar:

kmercountexact.sh in=reads.fq poloidy=1 khist=khist.txt peaks=peaks.txt

Can you describe the Jellyfish procedure in more detail, or link to it? Also, what kmer length are you using, how long are your reads, and what do you expect for the genome ploidy? Sometimes you can't tell apart a haploid from a diploid that has an extremely high het rate (SNPs are denser than kmer length) but I think that's uncommon in most organisms, and anyway, that would make the genome look bigger. It's also possible that you have a diploid with an extremely low het rate. Could it be that you have a super-inbred sample? If the organism has no (or very few) heterozygous sites, it would be impossible to tell a 1.9Gbp diploid (1.9Gbp per nucleus, meaning 950Mbp haploid size) from a 950Mbp haploid based on the kmer frequency histogram; they'd look identical.

Last edited by Brian Bushnell; 07-10-2017 at 03:27 PM.
Brian Bushnell is offline   Reply With Quote
Old 07-12-2017, 06:15 AM   #7
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 60
Default

Interesting. The genome size as estimated by flow cytometry was 2.1 Gb. The organism has been reasonably well studied cytogenetically (it is one of the worst invasive weeds in the great plains). It is most definitely a hexaploid - though the data suggests that the duplication of one of the genomes was very recent and the third genome is from a very closely related species. I ran a SNP analyses, and granted that although I was being very conservative, I only identified about 150K SNPs after mapping the fragments back to the assembly. Thus my first thought was that there was almost no heterozygosity, and the three genomes were so closely related that the kmers were collapsing the sequence. The Jellyfish method as I understand it is described here: http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html

Essentially you just sum the non-error kmers and divide by the coverage.

The Jellyfish run was done with a kmer of 21, and the BBMAP with the default of 31(?). However, I am about to rerun the BBMAP using the same Kmer size (and with the ploidy level specified). It doesn't take long, so I will also run at a number of different Kmer sizes just to see what that does to the curve.

I found that the genome size estimates from BBMAP were between 840 Mb (K=21, ploidy=6) to 1.1Gb (default).

I guess my primary question is: Which method ((kmer count)/coverage ) or BBMAP) is the best for estimating genome size and why?
horvathdp is offline   Reply With Quote
Old 07-12-2017, 10:00 AM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

If that's a hexaploid, then presuming the primary peak is the 6x homogeneous peak, the coverage is too low to see the haploid heterozygous peak, which would be at around 2.5x and indistinguishable from error kmers. Which makes it difficult to estimate the genome size. That said, if it's a hexaploid, and again presuming the primary peak is the 6x peak, that would indicate the total genome size (total number of DNA bp within each cell) is actually something like 6*950Mbp = 5.7Gbp which is much higher than your flow cytometry estimate.

For a haploid or diploid, with sufficient coverage, kmer-based genome size estimation is extremely accurate. When the coverage is insufficient, or the ploidy becomes pretty high, or the peaks are indistinct, it's less accurate. But in this case I can't really resolve the discrepancies... if it were me, I'd try to unbundle the DNA and look at it with an electron microscope and get a pretty picture like this:



Then it would be easy!

But, more seriously -

BBMap's KmerCountExact uses a default kmer length of 31. You can use longer or shorter kmers. Normally, the longer the kmer, the bigger the genome size estimate, since fewer short repeats are collapsed (collapsed short repeats often form a distinct peak which for genome size calculation will be multiplied by their multiplicity, but in this dataset there is only 1 peak).

I guess the best thing I can do is just describe exactly how peaks.txt is generated. The kmer-frequency histogram is pretty straightforward. The choice of K is driven mainly by the length and error rate of your reads - given sufficient coverage, error-free reads, and perfectly uniform coverage distribution, the genome size estimate should theoretically be independent of K (though in practice it's obviously not). For low coverage as in this case a shorter K may be useful.

The peaks file is generated by walking through the histogram and finding local maxima, then delimiting them by adjacent local minima. The volume is the sum of all unique kmers within that range. Once the peaks are defined (and in this case, there is only 1 clear peak... peaks.txt will probably report more peaks, but they are likely spurious and very small) the largest one is assumed to be the primary (homozygous) peak, though it depends on whether you set the ploidy flag. Assuming you don't set the ploidy flag, any peaks lower than the primary are assumed to be heterozygous peaks. The ploidy is then calculated by the ratio of the primary to the lowest peak - e.g., if the primary is at 60x and the lowest is 10x, the ploidy would be calculated as 6. Peaks higher than the primary are considered as replications. So, if the primary is at 60x and there is a smaller peak at 120x, and another at 180x, those would be considered 2x and 3x copy repeats. After 4x or so it all blurs together and the peaks cannot clearly be resolved.

To calculate the genome size for a hexaploid, it uses:

1*(volume of the peak at 1/6 of the coverage of the primary)
+2*(volume of the peak at 2/6 of the coverage of the primary)
+3*(volume of the peak at 3/6 of the coverage of the primary)
+4*(volume of the peak at 4/6 of the coverage of the primary)
+5*(volume of the peak at 5/6 of the coverage of the primary)
+6*(volume of the primary peak)
+X*(volume of any higher peak that is X/6 of the coverage of the primary)

That gives the total DNA in a cell. The haploid size would be that number divided by the ploidy. But in this case there is only 1 peak so the formula is not very useful. Also note that in practice, for a tetraploid with sufficient coverage (I have little experience with hexaploids), I typically see a peak at 1/4, 2/4, and 4/4, but never at 3/4. I'm not sure why; it should exist and be of equal magnitude to the 1/4 peak, assuming all variation is caused by SNPs. I guess this indicates insertion events.

Last edited by Brian Bushnell; 07-12-2017 at 10:05 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-12-2017, 10:15 AM   #9
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 60
Default

Your patience and kindness (not to mention the detailed response) is quite appreciated. Given all the assistance you have been over the years on this project, I feel I should list you as an author :-).
horvathdp is offline   Reply With Quote
Old 07-12-2017, 10:43 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I would appreciate it, but please don't feel obligated. As long as I'm cited, I'll be happy
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO