Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome size estimation-jellyfish

    I came across the paper titled "The Brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes".(http://www.nature.com/ncomms/2014/14...ncomms4930.pdf)

    In this paper they used jellyfish to estimate genome size. Please help me how to estimate genome size using jellyfish.

    Normally jellyfish outputs in 17 kmer (multiplicity (x axis) and number of distinct k-mer with given mulitiplicity (Y axis))
    as shown in https://banana-slug.soe.ucsc.edu/bio...ools:jellyfish.

    But in this paper they used to Depth (x axis) and Frequency % (Y axis) to estimate genome size of Brassic using 17 kmer (attached image link below).




    Please let me know how to calcuate it.

    for eg. Jellyfish output below, how to convert multiplicity to depth and no.of distinct k-mer to frequency (%)?

    Multiplicity No.of distinct k-mer
    1 677679866
    2 243735232
    3 148239594
    4 161663928
    Last edited by bioman1; 08-14-2014, 08:41 PM.

  • #2
    Assume a haploid genome, for simplicity. In the picture provided, the first peak at depth ~31 indicates amount of 1-copy content (in other words, the genome has exactly 1 copy of that kmer, so it is unique). The weak peak at ~62x indicates the amount of 2-copy content. Everything under ~11x can be assumed to be error kmers, unrelated to genome size.

    So, to estimate manually, take the sum of the counts of unique kmers under the first peak and multiply by 1; add the sum of the counts of unique kmers under the peak at 2x the depth of the first peak and multiply by 2; etc, for all peaks. This will give you the haploid genome size. So if your genome is tetraploid, the actual size will be 1/4 of your result, since the first peak will correspond to mutations present on only 1 ploidy (1/0/0/0 genotype).

    You can make this more accurate by modelling the peaks as a sum of Gaussian curves, but that probably won't change the result much. Of course, this method is subjective because calling peaks is subjective.

    Please note - I think 17-mers are too short for this kind of analysis. I prefer 31-mers because they are the longest computationally-efficient kmers. Also, FYI, BBNorm is faster than Jellyfish and can also generate kmer-frequency histograms:

    khist.sh in=reads.fq hist=khist.txt

    Also, it makes more sense to plot these things as log-log rather than linear-linear; and the Y-axis should be count, not frequency, which is useless for the purpose of genome-size estimation.
    Last edited by Brian Bushnell; 08-15-2014, 05:18 PM.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      You can make this more accurate by modelling the peaks as a sum of Gaussian curves, but that probably won't change the result much. Of course, this method is subjective because calling peaks is subjective.

      Please note - I think 17-mers are too short for this kind of analysis. I prefer 31-mers because they are the longest computationally-efficient kmers.
      What level of genome coverage do you find is necessary for this approach?

      Comment


      • #4
        Originally posted by SES View Post
        What level of genome coverage do you find is necessary for this approach?
        Enough to see the peaks clearly. This really depends on the evenness of the coverage (which affects the broadness of the peaks) and error rate (which make the 1-copy peak merge with the error kmers). It works fine at 30x with normal Illumina libraries, though the higher-order peaks start to run together and are hard to make out precisely after maybe 6-copy repeat content - the more coverage, the easier it is to resolve the repeat peaks, but normally the high-order ones don't constitute much of the genome anyway.

        I can't give a strict coverage lower bound for the this estimation technique but I would expect it to work well with as little as 15x of normal Illumina data, though at 100x the estimate will be more accurate.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        48 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X