Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome size estimation-jellyfish

    I came across the paper titled "The Brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes".(http://www.nature.com/ncomms/2014/14...ncomms4930.pdf)

    In this paper they used jellyfish to estimate genome size. Please help me how to estimate genome size using jellyfish.

    Normally jellyfish outputs in 17 kmer (multiplicity (x axis) and number of distinct k-mer with given mulitiplicity (Y axis))
    as shown in https://banana-slug.soe.ucsc.edu/bio...ools:jellyfish.

    But in this paper they used to Depth (x axis) and Frequency % (Y axis) to estimate genome size of Brassic using 17 kmer (attached image link below).




    Please let me know how to calcuate it.

    for eg. Jellyfish output below, how to convert multiplicity to depth and no.of distinct k-mer to frequency (%)?

    Multiplicity No.of distinct k-mer
    1 677679866
    2 243735232
    3 148239594
    4 161663928
    Last edited by bioman1; 08-14-2014, 08:41 PM.

  • #2
    Assume a haploid genome, for simplicity. In the picture provided, the first peak at depth ~31 indicates amount of 1-copy content (in other words, the genome has exactly 1 copy of that kmer, so it is unique). The weak peak at ~62x indicates the amount of 2-copy content. Everything under ~11x can be assumed to be error kmers, unrelated to genome size.

    So, to estimate manually, take the sum of the counts of unique kmers under the first peak and multiply by 1; add the sum of the counts of unique kmers under the peak at 2x the depth of the first peak and multiply by 2; etc, for all peaks. This will give you the haploid genome size. So if your genome is tetraploid, the actual size will be 1/4 of your result, since the first peak will correspond to mutations present on only 1 ploidy (1/0/0/0 genotype).

    You can make this more accurate by modelling the peaks as a sum of Gaussian curves, but that probably won't change the result much. Of course, this method is subjective because calling peaks is subjective.

    Please note - I think 17-mers are too short for this kind of analysis. I prefer 31-mers because they are the longest computationally-efficient kmers. Also, FYI, BBNorm is faster than Jellyfish and can also generate kmer-frequency histograms:

    khist.sh in=reads.fq hist=khist.txt

    Also, it makes more sense to plot these things as log-log rather than linear-linear; and the Y-axis should be count, not frequency, which is useless for the purpose of genome-size estimation.
    Last edited by Brian Bushnell; 08-15-2014, 05:18 PM.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      You can make this more accurate by modelling the peaks as a sum of Gaussian curves, but that probably won't change the result much. Of course, this method is subjective because calling peaks is subjective.

      Please note - I think 17-mers are too short for this kind of analysis. I prefer 31-mers because they are the longest computationally-efficient kmers.
      What level of genome coverage do you find is necessary for this approach?

      Comment


      • #4
        Originally posted by SES View Post
        What level of genome coverage do you find is necessary for this approach?
        Enough to see the peaks clearly. This really depends on the evenness of the coverage (which affects the broadness of the peaks) and error rate (which make the 1-copy peak merge with the error kmers). It works fine at 30x with normal Illumina libraries, though the higher-order peaks start to run together and are hard to make out precisely after maybe 6-copy repeat content - the more coverage, the easier it is to resolve the repeat peaks, but normally the high-order ones don't constitute much of the genome anyway.

        I can't give a strict coverage lower bound for the this estimation technique but I would expect it to work well with as little as 15x of normal Illumina data, though at 100x the estimate will be more accurate.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        47 views
        0 likes
        Last Post seqadmin  
        Working...
        X