Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to estimating the genome size

    Hi everyone, how can we actually estimate the genome size if there does not exist a reference genome or any genome that is significantly close enough to your sample?

    One genome estimation method used by the BGI in assembling the Giant panda genome is to use the 17-mer. I don't quite get their idea, would anyone help explain in this?

    From their supplementary, " Distribution of 17-mer frequency in the raw sequencing reads. We used all reads from the short insert-size libraries (<500bp). The peak depth is at 15X. The peak of 17-mer frequency (M) in reads is correlated with the real sequencing depth (N), read length (L), and kmer length (K), their relations can be expressed in a experienced formula: M = N * (L – K + 1) / L. Then, we divided the total sequence length by the real sequencing depth and obtained an estimated the genome size of 2.46 Gb."

    FYR, the paper is titled as "The sequence and de novo assembly of the giant panda genome"

  • #2
    I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

    The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

    Comment


    • #3
      Originally posted by figure002 View Post
      I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

      The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
      Hey thanks for the information, so far I have just gone through the program Tallymer

      It works great but somehow if you have a very large read size the file size of the suffix tree gonna be tremendously large. There is another paper from Waterman showing how to predict the genome size from the k-mer - Estimating the Repeat Structure and Length of DNA Sequences Using ℓ-Tuples

      Do you have any idea on the speed and memory consumption of jellyfish?

      Comment


      • #4
        Thanks for pointing me to Tallymer. I was looking for more tools to test on our data.

        Another tool I found is GSP, but it doesn't look like a very a decent tool. The source package is all messy and I couldn't find it in a publication.

        I've only just finished compiling jellyfish (jellyfish 1.1 has some compilation issues, but we just received a patch from the developer that fixes these issues). So I can't tell you anything about performance right now. I'll report back as soon as I have some results.

        Comment


        • #5
          Originally posted by figure002 View Post
          Thanks for pointing me to Tallymer. I was looking for more tools to test on our data.

          Another tool I found is GSP, but it doesn't look like a very a decent tool. The source package is all messy and I couldn't find it in a publication.

          I've only just finished compiling jellyfish (jellyfish 1.1 has some compilation issues, but we just received a patch from the developer that fixes these issues). So I can't tell you anything about performance right now. I'll report back as soon as I have some results.
          Hi figure002, I guess the compilation error that u've encountered is this - "warnings being treated as errors". In my case, simply remove all the "-Werror" in the Makefiles would do.

          Hope this may help.

          Comment


          • #6
            Originally posted by yanij View Post
            Hi figure002, I guess the compilation error that u've encountered is this - "warnings being treated as errors". In my case, simply remove all the "-Werror" in the Makefiles would do.

            Hope this may help.
            True, that's what I did at first. But the developer was so kind to fix the source which makes removing the -Werror unnecessary. He said he would upload the new package. That should at least make things easier.

            Comment


            • #7
              Originally posted by yanij View Post
              Do you have any idea on the speed and memory consumption of jellyfish?
              I did some runs of both jellyfish and tallymer on test data, and I noticed that jellyfish is much faster (it was running with 32 threads) when it comes to k-mer counting. According to the Jellyfish paper, "Jellyfish offers a much faster and more memory-efficient solution" than suffix arrays, which are used in Tallymer I believe.

              At this moment I'm running "tallymer suffixerator" and "jellyfish count" next to each other on a machine with 32 cores. "jellyfish count" is using around 0.2% memory, while "tallymer suffixerator" is using around 3.0% memory.

              Thus so far I can confirm that Jellyfish is indeed faster and more memory efficient.

              Comment


              • #8
                Originally posted by figure002 View Post
                I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

                The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
                I think that this estimation is based on a gamma distribution and is similar to the calculations made by Quake, Ray, ABySS, etc. and depend on something like 10-15X coverage. With low coverage, my experience is that the distribution of k-mer copies will show an exponential decay with the rate of decay depending on the repeat content and k-mer length. Does anyone else have experience in trying to make these calculations? It would be great if there was a way to make a reasonable estimate from lower coverage shotgun data.

                Comment


                • #9
                  Originally posted by SES View Post
                  I think that this estimation is based on a gamma distribution and is similar to the calculations made by Quake, Ray, ABySS, etc. and depend on something like 10-15X coverage. With low coverage, my experience is that the distribution of k-mer copies will show an exponential decay with the rate of decay depending on the repeat content and k-mer length. Does anyone else have experience in trying to make these calculations? It would be great if there was a way to make a reasonable estimate from lower coverage shotgun data.
                  This paper present a good way to estimate genome size with low cov shotgun data, though it requires an assembled transcriptome as reference...

                  Comment


                  • #10
                    Originally posted by Qingl View Post
                    This paper present a good way to estimate genome size with low cov shotgun data, though it requires an assembled transcriptome as reference...
                    Our results provide the first global view of venom-duct transcription in any cone snail. A notable feature of Conus bullatus venoms is the breadth of A-superfamily peptides expressed in the venom duct, which are unprecedented in their structural diversity. We also find SNP rates within conopeptides …

                    Comment


                    • #11
                      This is an interesting approach that I had not seen. It is not clear how/if the efficacy of the method was evaluated but it is something to explore. Thanks for the response.

                      Comment


                      • #12
                        Originally posted by SES View Post
                        This is an interesting approach that I had not seen. It is not clear how/if the efficacy of the method was evaluated but it is something to explore. Thanks for the response.
                        Sure The method has control sample that would support the efficacy

                        Comment


                        • #13
                          Originally posted by Qingl View Post
                          Sure The method has control sample that would support the efficacy
                          Excellent. Do you mean control with known genome size? That is what I was wondering. I'll take a closer look at the paper and see if I can apply the methods to my system.

                          Comment


                          • #14
                            Originally posted by SES View Post
                            Excellent. Do you mean control with known genome size? That is what I was wondering. I'll take a closer look at the paper and see if I can apply the methods to my system.
                            Yes, I agree it's an excellent method~~~~

                            Comment


                            • #15
                              Hi yanij

                              I don't know how useful this will be to you give the time since your post, but just in case....

                              The BGI method is based around the observation that the coverage achieved for a genome is based on the size of the genome and the total amount of sequence data generated. So if you sequence 100 Mb of data for a 10 Mb genome, you should get ~10-fold coverage.

                              Or as a simple equation: depth of coverage = total data / genome length.

                              If you have any two of these parameters (i.e., you know the amount of data you generated and you know the genome size) obviously you can calculate the third.

                              Usually when doing de novo genome sequencing you don't know the genome size, and since you don't have the genome, you don't know the coverage, but you do know how much data you've generated (i.e., the 'total sequence length' to use BGI's term). To estimate the genome size, you then need to estimate the coverage depth (N).

                              To do this, you can calculate the kmer frequency within your read data (most people will do this for one of their small insert libraries for which they have the most information). Meaning you chop all of the reads you've generated up in to kmers (a kmer of 17 is the most common, as it is long enough to yield fairly specific sequences (meaning that its unlikely the kmer is repeated throughout the genome by chance), but short enough to give you lots of data). You then count the frequency with which each 17-mer represented by your data is found among all of the reads generated and create a frequency histogram of this information. For non-repetitive regions of the genome, this histogram should be normally distributed around a single peak (although in real data you will have a asymptote near 1 because of rare sequencing errors etc). This peak value (or peak depth) is the mean kmer coverage for your data.

                              You can relate this value to the actual coverage of your genome using the formula M = N * (L – K + 1) / L, where M is the mean kmer coverage, N is the actual coverage of the genome, L is the mean read length and k is the kmer size.

                              L -k +1 gives you the number of kmers created per read.

                              So basically what the formula says is the kmer coverage for a genome is equal to the mean read coverage * the number of kmers per read divided by the read length.

                              Because you know L (your mean read length) and k (the kmer you used to estimate peak kmer coverage) and you've calculated M (soapdenovo comes with a script called kmerfreq that will this), you simply solve the equation for N as:

                              N = M/((L-k+1)/L) and calculate N.

                              Once you have that, divide your total sequence data by N and you have your genome estimate.

                              Hope that helps.
                              Last edited by aaronrjex; 01-10-2013, 05:03 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X