Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by aaronrjex View Post
    Hi yanij

    I don't know how useful this will be to you give the time since your post, but just in case....

    The BGI method is based around the observation that the coverage achieved for a genome is based on the size of the genome and the total amount of sequence data generated. So if you sequence 100 Mb of data for a 10 Mb genome, you should get ~10-fold coverage.

    Or as a simple equation: depth of coverage = total data / genome length.

    If you have any two of these parameters (i.e., you know the amount of data you generated and you know the genome size) obviously you can calculate the third.

    Usually when doing de novo genome sequencing you don't know the genome size, and since you don't have the genome, you don't know the coverage, but you do know how much data you've generated (i.e., the 'total sequence length' to use BGI's term). To estimate the genome size, you then need to estimate the coverage depth (N).

    To do this, you can calculate the kmer frequency within your read data (most people will do this for one of their small insert libraries for which they have the most information). Meaning you chop all of the reads you've generated up in to kmers (a kmer of 17 is the most common, as it is long enough to yield fairly specific sequences (meaning that its unlikely the kmer is repeated throughout the genome by chance), but short enough to give you lots of data). You then count the frequency with which each 17-mer represented by your data is found among all of the reads generated and create a frequency histogram of this information. For non-repetitive regions of the genome, this histogram should be normally distributed around a single peak (although in real data you will have a asymptote near 1 because of rare sequencing errors etc). This peak value (or peak depth) is the mean kmer coverage for your data.

    You can relate this value to the actual coverage of your genome using the formula M = N * (L – K + 1) / L, where M is the mean kmer coverage, N is the actual coverage of the genome, L is the mean read length and k is the kmer size.

    L -k +1 gives you the number of kmers created per read.

    So basically what the formula says is the kmer coverage for a genome is equal to the mean read coverage * the number of kmers per read divided by the read length.

    Because you know L (your mean read length) and k (the kmer you used to estimate peak kmer coverage) and you've calculated M (soapdenovo comes with a script called kmerfreq that will this), you simply solve the equation for N as:

    N = M/((L-k+1)/L) and calculate N.

    Once you have that, divide your total sequence data by N and you have your genome estimate.

    Hope that helps.
    This is a very helpful description of BGI's methods but I have one question. How do you determine M, what you call the mean k-mer coverage, from the k-mer frequency histogram produced by soapdenovo's pregraph program (the file with the *.kmerFreq extension), or with any other method? The first post in this thread quotes the methods from the Panda genome paper as saying M is the peak k-mer coverage (I guess on a normal distribution), and that is what is challenging to calculate.

    Comment


    • #17
      Hello folks,

      I am trying to understand the genome size estimation method, and here is the part that is not clear to me.

      BGI is estimating coverage depth by peak of frequency histogram, but in highly repetitive genomes the peak shifts down. E.g. check the second chart here, where we generated simulated reads from sea urchin genome at 50X coverage, but the histogram peak came at ~40 due to repeats.



      If that is true, how does BGI estimate genome size so precisely? In later section of bat paper, they claimed that the assembly size is close to the 'estimated size', which is puzzling given the impreciseness in genome size estimation.

      Maybe I am missing something. Please help !!

      ---------------

      Edit. I am rerunning the above simulation to make sure everything is done correctly. Results will be reported here.
      Last edited by samanta; 04-10-2013, 04:59 PM.
      http://homolog.us

      Comment


      • #18
        Lizhenyu from BGI explained where I made the error in thinking. When a read has 100 nucleotides and is split into 21-mers, the read will produce only 80 k-mers, not 100. Here is his full response, which agrees with aaronrjex's post above.



        "Hi,

        I think you mixed the concepts of base coverage depth and kmer coverage depth.
        When you refered 50X genome coverage, it meant base coverage which is obtained by

        total_base_num/genome_size=(read_num*read_length)/genome_size.

        Similarly, the kmer coverage depth, the peak value in kmer frequency curve, is calculated by

        total_kmer_num/genome_size=read_num*(read_length-kmer_size+1)/genome_size.

        So the relationship between base coverage depth and kmer coverage depth is:

        kmer_coverage_depth = base_coverage_depth*(read_length-kmer_size+1)/read_length.

        In your case, kmer_coverage_depth = 50 * (100 - 21 + 1)/100 = 40, which is exactly
        the peak value in you plot.

        best,"
        http://homolog.us

        Comment


        • #19
          Ok, thanks for the nice explanation
          Last edited by mflderks; 09-27-2013, 12:01 AM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X