Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome size estimation using BBMAP

    Hi all,
    I have 21,005,534 x 2 paired-end genome sequencing data.
    To estimate the genome size, I used BBMAP.
    Is there anyone who could help to interpret the output?

    The command:
    Code:
    ./kmercountexact.sh in1=read1.fastq in2=read2.fastq khist=khist.txt peaks=peaks.txt
    The 'peaks.txt' output:
    #k 31
    #unique_kmers 1432946339
    #main_peak 31
    #genome_size 285812914
    #haploid_genome_size 95270971
    #fold_coverage 31
    #haploid_fold_coverage 112
    #ploidy 3
    #het_rate 0.03413
    #percent_repeat 6.291
    #start center stop max volume
    13 31 58 6211343 100786874
    58 65 71 73206 913749
    71 112 170 1196232 41304813
    170 224 285 19062 1199939
    285 324 325 3981 138959
    325 329 396 4084 201426
    396 414 443 1692 75732
    443 447 525 1481 92625
    525 526 535 833 7869
    535 539 541 822 4724
    541 543 557 797 12164
    557 565 1413 725 181119
    Q1.
    According to this post (http://seqanswers.com/forums/archive...p/t-48375.html),
    to calculate the genome size, 100786874 + 913749/2 + 41304813/3 + ... + 12164/11 + 181119/12 = 144,919,993.
    However, on the 4th line of the output, the genome size is 285,812,914.
    What is the real genome size?

    Q2.
    Also, what does each line of the output mean (e.g. unique_kmers, haploid_genome_size, haploid_fold_coverage, ploidy, het_rate, percent_repeat)?

    Any advice would be appreciated.

  • #2
    Hi!

    Unfortunately, it looks like ploidy was estimated wrong; I don't think anything has a ploidy of 3. If the peak calls are correct, it's probably a highly polymorphic tetraploid, or a highly repetitive diploid or haploid. The full genome size would be approximately:

    1*100786874 + 2*913749 + 4*41304813 + 8*1199939 + 10*138959 ...etc

    Yielding 278822726, and so including all the peaks it would be around 285812914 as the program indicated. However the multipliers for the various peaks are hard to determine. Generally, the multiplier is the the peak's center divided by the first peak's center, so the first multiplier is 31/31=1, the second is 65/31=2. But the third is 112/31=3.6, which is hard to tell if it's 3 or 4 (they are all integers). It's probably 4.

    Thus the genome is around 285Mbp all told, but it's hard to figure out the ploidy which makes it hard to figure out the haploid size. If it's a haploid then the genome is ~285MB. If it's tetraploid then you divide it by 4 and get 4 copies of a 71Mbp genome. For example, a human cell has around 6 Gbp of DNA in it, but because humans are diploid, the haploid representation of the genome is around 3 Gbp long.

    The number of unique kmers is not important. Fold coverage is the amount of times you have sequenced each base, on average. Haploid genome size is the size of the genome divided by the ploidy. Haploid fold coverage is the number of times you have sequenced each base in the haploid representation, if the ploidy estimate was correct, which it isn't. Ploidy is the number of copies of each chromosome per cell. Het rate is an estimate of the fraction of the genome that is heterozygous (usually meaning a SNP); so 0.03413 indicates that 3.4% of the genome is heterozygous, meaning there is a SNP every 30 bases, which is very high. Percent repeat is the percent of the genome that occurs more than once as an exact copy; this number comes from analyzing the high-depth peaks, which come from repeat regions. However, if the ploidy estimate is off, then the het rate and percent repeat will also be off.

    Usually the ploidy estimate works well for haploid and diploid organisms, but tetraploid and higher are difficult.

    As for the post you pointed to, I was totally wrong about the formula, it should have been multiplication, not division. I will edit it.
    Last edited by Brian Bushnell; 11-06-2015, 10:41 AM.

    Comment


    • #3
      Hi All (and particularly Brian Bushnell),
      I am trying to estimate the genome size of my leafy spurge genome with BBMAP. The ouput says my genome size is a haploid of ~950MB. However, using a procedure described for Jellyfish kmer counts – (sum of (kmer count* the number of kmers))/the peak fold I get 4.6 Gb. The flow cytometry data we have for leafy spurge is 2.1Gb for the 1x genome. From the discussion above, it seems that the BBMAP program measures area under the curve for each peak to get it’s genome size estimate. Given the additional data from the flow cytometry results, I am inclined to believe the method described for the Jellyfish data, but I was wondering why the two methods give such different results, and if there was any reason to prefer the BBMAP method.

      Comment


      • #4
        The estimate is provided for convenience, but the accuracy is not guaranteed since it has to guess the peak locations, peak limits, and ploidy of the organism. It can be somewhat more accurate if you know the ploidy, in which case you can tell it the ploidy, but it still has to guess which peak is the homozygous peak, etc.

        I find that I can usually do a better job manually by eyeballing the peaks (in conjunction with the exact peak volumes from peaks.txt). Can you post an image of the graph, and the contents of peaks.txt?

        Comment


        • #5
          Thanks!!!

          Here is a link to the data:
          https://drive.google.com/file/d/0Byp...ew?usp=sharing. I didn't know I could define the ploidy level. Can you share a hypothetical script with that option indicated?

          Comment


          • #6
            Visually, that looks like a very obvious haploid to me, with median coverage at ~16x and genome size of ~700Mbp for the unique stuff (using SUM(B8:B28) from your spreadsheet) and a couple hundred Mbp for repetitive stuff (sum of e.g. B29*$A$18/A31 for subsequent lines), so I'd say that KmerCountExact's result is correct for both ploidy and genome size.

            You can specify the ploidy by adding the flag "ploidy=1" or similar:

            kmercountexact.sh in=reads.fq poloidy=1 khist=khist.txt peaks=peaks.txt

            Can you describe the Jellyfish procedure in more detail, or link to it? Also, what kmer length are you using, how long are your reads, and what do you expect for the genome ploidy? Sometimes you can't tell apart a haploid from a diploid that has an extremely high het rate (SNPs are denser than kmer length) but I think that's uncommon in most organisms, and anyway, that would make the genome look bigger. It's also possible that you have a diploid with an extremely low het rate. Could it be that you have a super-inbred sample? If the organism has no (or very few) heterozygous sites, it would be impossible to tell a 1.9Gbp diploid (1.9Gbp per nucleus, meaning 950Mbp haploid size) from a 950Mbp haploid based on the kmer frequency histogram; they'd look identical.
            Last edited by Brian Bushnell; 07-10-2017, 02:27 PM.

            Comment


            • #7
              Interesting. The genome size as estimated by flow cytometry was 2.1 Gb. The organism has been reasonably well studied cytogenetically (it is one of the worst invasive weeds in the great plains). It is most definitely a hexaploid - though the data suggests that the duplication of one of the genomes was very recent and the third genome is from a very closely related species. I ran a SNP analyses, and granted that although I was being very conservative, I only identified about 150K SNPs after mapping the fragments back to the assembly. Thus my first thought was that there was almost no heterozygosity, and the three genomes were so closely related that the kmers were collapsing the sequence. The Jellyfish method as I understand it is described here: http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html

              Essentially you just sum the non-error kmers and divide by the coverage.

              The Jellyfish run was done with a kmer of 21, and the BBMAP with the default of 31(?). However, I am about to rerun the BBMAP using the same Kmer size (and with the ploidy level specified). It doesn't take long, so I will also run at a number of different Kmer sizes just to see what that does to the curve.

              I found that the genome size estimates from BBMAP were between 840 Mb (K=21, ploidy=6) to 1.1Gb (default).

              I guess my primary question is: Which method ((kmer count)/coverage ) or BBMAP) is the best for estimating genome size and why?

              Comment


              • #8
                If that's a hexaploid, then presuming the primary peak is the 6x homogeneous peak, the coverage is too low to see the haploid heterozygous peak, which would be at around 2.5x and indistinguishable from error kmers. Which makes it difficult to estimate the genome size. That said, if it's a hexaploid, and again presuming the primary peak is the 6x peak, that would indicate the total genome size (total number of DNA bp within each cell) is actually something like 6*950Mbp = 5.7Gbp which is much higher than your flow cytometry estimate.

                For a haploid or diploid, with sufficient coverage, kmer-based genome size estimation is extremely accurate. When the coverage is insufficient, or the ploidy becomes pretty high, or the peaks are indistinct, it's less accurate. But in this case I can't really resolve the discrepancies... if it were me, I'd try to unbundle the DNA and look at it with an electron microscope and get a pretty picture like this:



                Then it would be easy!

                But, more seriously -

                BBMap's KmerCountExact uses a default kmer length of 31. You can use longer or shorter kmers. Normally, the longer the kmer, the bigger the genome size estimate, since fewer short repeats are collapsed (collapsed short repeats often form a distinct peak which for genome size calculation will be multiplied by their multiplicity, but in this dataset there is only 1 peak).

                I guess the best thing I can do is just describe exactly how peaks.txt is generated. The kmer-frequency histogram is pretty straightforward. The choice of K is driven mainly by the length and error rate of your reads - given sufficient coverage, error-free reads, and perfectly uniform coverage distribution, the genome size estimate should theoretically be independent of K (though in practice it's obviously not). For low coverage as in this case a shorter K may be useful.

                The peaks file is generated by walking through the histogram and finding local maxima, then delimiting them by adjacent local minima. The volume is the sum of all unique kmers within that range. Once the peaks are defined (and in this case, there is only 1 clear peak... peaks.txt will probably report more peaks, but they are likely spurious and very small) the largest one is assumed to be the primary (homozygous) peak, though it depends on whether you set the ploidy flag. Assuming you don't set the ploidy flag, any peaks lower than the primary are assumed to be heterozygous peaks. The ploidy is then calculated by the ratio of the primary to the lowest peak - e.g., if the primary is at 60x and the lowest is 10x, the ploidy would be calculated as 6. Peaks higher than the primary are considered as replications. So, if the primary is at 60x and there is a smaller peak at 120x, and another at 180x, those would be considered 2x and 3x copy repeats. After 4x or so it all blurs together and the peaks cannot clearly be resolved.

                To calculate the genome size for a hexaploid, it uses:

                1*(volume of the peak at 1/6 of the coverage of the primary)
                +2*(volume of the peak at 2/6 of the coverage of the primary)
                +3*(volume of the peak at 3/6 of the coverage of the primary)
                +4*(volume of the peak at 4/6 of the coverage of the primary)
                +5*(volume of the peak at 5/6 of the coverage of the primary)
                +6*(volume of the primary peak)
                +X*(volume of any higher peak that is X/6 of the coverage of the primary)

                That gives the total DNA in a cell. The haploid size would be that number divided by the ploidy. But in this case there is only 1 peak so the formula is not very useful. Also note that in practice, for a tetraploid with sufficient coverage (I have little experience with hexaploids), I typically see a peak at 1/4, 2/4, and 4/4, but never at 3/4. I'm not sure why; it should exist and be of equal magnitude to the 1/4 peak, assuming all variation is caused by SNPs. I guess this indicates insertion events.
                Last edited by Brian Bushnell; 07-12-2017, 09:05 AM.

                Comment


                • #9
                  Your patience and kindness (not to mention the detailed response) is quite appreciated. Given all the assistance you have been over the years on this project, I feel I should list you as an author :-).

                  Comment


                  • #10
                    I would appreciate it, but please don't feel obligated. As long as I'm cited, I'll be happy

                    Comment


                    • #11
                      Dear all and Brian Bushnell,

                      I'm using kmercountexact.sh to estimate the genome size of an insect. I tried 17 mer and 31 mer, but got different results. After reading all the discussions above, I still don't know how to do (perhaps due to my poor English...). Here are the codes and results:

                      1. k=17
                      Code:
                      kmercountexact.sh in=270B.0.fastq in2=270B.1.fastq k=17 khist=270.his out=270 peaks=270.peaks
                      and the 270.peaks:
                      HTML Code:
                      #k	17
                      #unique_kmers	502145545
                      #main_peak	47
                      #genome_size	574333111
                      #haploid_genome_size	574333111
                      #fold_coverage	24
                      #haploid_fold_coverage	24
                      #ploidy	1
                      #percent_repeat	86.388
                      #start	center	stop	max	volume
                      10	24	32	4706167	78178249	
                      32	47	118	5933924	195017237	
                      1097	1103	1256	2449	337060	
                      1256	1259	1300	1819	77545	
                      1300	1303	1368	1756	108489	
                      1368	1372	1374	1548	8989	
                      1374	1376	1391	1491	24859	
                      1391	1394	1421	1487	42110	
                      1421	1423	1435	1424	18973	
                      1435	1443	1498	1413	80666	
                      1498	1500	1517	1188	22777	
                      1517	1543	3853	1167	1077482
                      2. k=31
                      Code:
                      kmercountexact.sh in=270B.0.fastq in2=270B.1.fastq khist=270.31mer.his out=270.31mer peaks=270.31mer.peaks
                      and the 270.31mer.peaks:
                      HTML Code:
                      #k	31
                      #unique_kmers	877006010
                      #main_peak	41
                      #genome_size	846756437
                      #haploid_genome_size	423378218
                      #fold_coverage	20
                      #haploid_fold_coverage	41
                      #ploidy	2
                      #het_rate	0.01748
                      #percent_repeat	13.342
                      #start	center	stop	max	volume
                      8	20	32	12998950	229457745	
                      32	41	103	10571247	252163197	
                      702	705	748	4040	175332	
                      748	754	971	3523	588862	
                      971	973	1002	2005	59786	
                      1002	1007	1064	1870	108141	
                      1064	1065	1121	1635	89229	
                      1121	1123	1147	1515	37584	
                      1147	1154	1189	1392	56941	
                      1189	1198	1214	1327	32072	
                      1214	1216	1295	1267	95219	
                      1295	1304	3260	1131	896204
                      As you can see, the genome size are very different and ploidy are different. The insect is diploid, and it seems that the ploidy predicted is more accurate when using 31 mer.

                      I've also tried the Jellyfish way (http://koke.asrc.kanazawa-u.ac.jp/HO...enomesize.html) using 17 mer and got 707955101 bp. But I'm little doublt, is it a haploid size or total size?

                      Any suggestions would be appreciated. I'm totally stucked.

                      Comment


                      • #12
                        I am not an expert, but I would bet that when you used a kmer of 17, that it collapsed a bit of the genome because the program was not able to easily distinguish between kmers coming from 2 different alleles of the same gene. There would be a greater chance for two 31mers -each from a different allelic copies genes to show differences that could distinguish between the fragments. I would guess that Jellyfish might have done a better job of distinguishing differences between alleles, but it was still working with a fairly small kmer size. I'd bet that increasing the kmer size in Jellyfish will give a closer genome size estimate obtained from the 31kmer BBMAP run. I had a similar issue with my organism (a purported hexaploid weed), where the 3 genomes are so similar and had such little heterozygosity that even with a kmer of 35, the genomes all collapsed and I got a genome size estimate that was a bit over a third of what we got for a genome size for with flow cytometry, and it also came out as a haploid.

                        Comment


                        • #13
                          Originally posted by horvathdp View Post
                          I am not an expert, but I would bet that when you used a kmer of 17, that it collapsed a bit of the genome because the program was not able to easily distinguish between kmers coming from 2 different alleles of the same gene.
                          I concur; 17 is really too short for this purpose. When trying to estimate genome size, it's important for most of the kmers to be unique (aside from long perfect repeats); so, kmer lengths greater than 20 are best. I normally choose 31 because it is the longest kmer that can be represented in a memory-efficient way.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Advancing Precision Medicine for Rare Diseases in Children
                            by seqadmin




                            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                            12-16-2024, 07:57 AM
                          • seqadmin
                            Recent Advances in Sequencing Technologies
                            by seqadmin



                            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                            Long-Read Sequencing
                            Long-read sequencing has seen remarkable advancements,...
                            12-02-2024, 01:49 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 12-17-2024, 10:28 AM
                          0 responses
                          33 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-13-2024, 08:24 AM
                          0 responses
                          48 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-12-2024, 07:41 AM
                          0 responses
                          34 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-11-2024, 07:45 AM
                          0 responses
                          46 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X