SEQanswers How to estimating the genome size
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post moinul De novo discovery 9 04-04-2014 03:22 AM tboothby General 1 11-18-2011 07:28 AM Jordy224 Bioinformatics 2 11-22-2010 09:27 PM newbie25 454 Pyrosequencing 2 08-12-2010 09:34 AM thomasvangurp Bioinformatics 0 08-13-2009 10:34 AM

 04-27-2011, 08:02 AM #1 yanij Junior Member   Location: Hong Kong Join Date: Apr 2011 Posts: 3 How to estimating the genome size Hi everyone, how can we actually estimate the genome size if there does not exist a reference genome or any genome that is significantly close enough to your sample? One genome estimation method used by the BGI in assembling the Giant panda genome is to use the 17-mer. I don't quite get their idea, would anyone help explain in this? From their supplementary, " Distribution of 17-mer frequency in the raw sequencing reads. We used all reads from the short insert-size libraries (<500bp). The peak depth is at 15X. The peak of 17-mer frequency (M) in reads is correlated with the real sequencing depth (N), read length (L), and kmer length (K), their relations can be expressed in a experienced formula: M = N * (L – K + 1) / L. Then, we divided the total sequence length by the real sequencing depth and obtained an estimated the genome size of 2.46 Gb." FYR, the paper is titled as "The sequence and de novo assembly of the giant panda genome"
 05-09-2011, 02:17 AM #2 figure002 Member   Location: Rotterdam Join Date: Jan 2011 Posts: 11 I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA. The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
05-09-2011, 02:41 AM   #3
yanij
Junior Member

Location: Hong Kong

Join Date: Apr 2011
Posts: 3

Quote:
 Originally Posted by figure002 I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA. The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
Hey thanks for the information, so far I have just gone through the program Tallymer

It works great but somehow if you have a very large read size the file size of the suffix tree gonna be tremendously large. There is another paper from Waterman showing how to predict the genome size from the k-mer - Estimating the Repeat Structure and Length of DNA Sequences Using ℓ-Tuples

Do you have any idea on the speed and memory consumption of jellyfish?

 05-11-2011, 12:28 AM #4 figure002 Member   Location: Rotterdam Join Date: Jan 2011 Posts: 11 Thanks for pointing me to Tallymer. I was looking for more tools to test on our data. Another tool I found is GSP, but it doesn't look like a very a decent tool. The source package is all messy and I couldn't find it in a publication. I've only just finished compiling jellyfish (jellyfish 1.1 has some compilation issues, but we just received a patch from the developer that fixes these issues). So I can't tell you anything about performance right now. I'll report back as soon as I have some results.
05-11-2011, 02:59 AM   #5
yanij
Junior Member

Location: Hong Kong

Join Date: Apr 2011
Posts: 3

Quote:
 Originally Posted by figure002 Thanks for pointing me to Tallymer. I was looking for more tools to test on our data. Another tool I found is GSP, but it doesn't look like a very a decent tool. The source package is all messy and I couldn't find it in a publication. I've only just finished compiling jellyfish (jellyfish 1.1 has some compilation issues, but we just received a patch from the developer that fixes these issues). So I can't tell you anything about performance right now. I'll report back as soon as I have some results.
Hi figure002, I guess the compilation error that u've encountered is this - "warnings being treated as errors". In my case, simply remove all the "-Werror" in the Makefiles would do.

Hope this may help.

05-11-2011, 04:05 AM   #6
figure002
Member

Location: Rotterdam

Join Date: Jan 2011
Posts: 11

Quote:
 Originally Posted by yanij Hi figure002, I guess the compilation error that u've encountered is this - "warnings being treated as errors". In my case, simply remove all the "-Werror" in the Makefiles would do. Hope this may help.
True, that's what I did at first. But the developer was so kind to fix the source which makes removing the -Werror unnecessary. He said he would upload the new package. That should at least make things easier.

05-17-2011, 07:40 AM   #7
figure002
Member

Location: Rotterdam

Join Date: Jan 2011
Posts: 11

Quote:
 Originally Posted by yanij Do you have any idea on the speed and memory consumption of jellyfish?
I did some runs of both jellyfish and tallymer on test data, and I noticed that jellyfish is much faster (it was running with 32 threads) when it comes to k-mer counting. According to the Jellyfish paper, "Jellyfish offers a much faster and more memory-efficient solution" than suffix arrays, which are used in Tallymer I believe.

At this moment I'm running "tallymer suffixerator" and "jellyfish count" next to each other on a machine with 32 cores. "jellyfish count" is using around 0.2% memory, while "tallymer suffixerator" is using around 3.0% memory.

Thus so far I can confirm that Jellyfish is indeed faster and more memory efficient.

11-14-2012, 06:52 AM   #8
SES
Senior Member

Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275

Quote:
 Originally Posted by figure002 I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA. The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
I think that this estimation is based on a gamma distribution and is similar to the calculations made by Quake, Ray, ABySS, etc. and depend on something like 10-15X coverage. With low coverage, my experience is that the distribution of k-mer copies will show an exponential decay with the rate of decay depending on the repeat content and k-mer length. Does anyone else have experience in trying to make these calculations? It would be great if there was a way to make a reasonable estimate from lower coverage shotgun data.

11-14-2012, 07:24 AM   #9
Qingl
Member

Location: salt lake city

Join Date: Sep 2012
Posts: 17

Quote:
 Originally Posted by SES I think that this estimation is based on a gamma distribution and is similar to the calculations made by Quake, Ray, ABySS, etc. and depend on something like 10-15X coverage. With low coverage, my experience is that the distribution of k-mer copies will show an exponential decay with the rate of decay depending on the repeat content and k-mer length. Does anyone else have experience in trying to make these calculations? It would be great if there was a way to make a reasonable estimate from lower coverage shotgun data.
This paper present a good way to estimate genome size with low cov shotgun data, though it requires an assembled transcriptome as reference...

11-14-2012, 07:25 AM   #10
Qingl
Member

Location: salt lake city

Join Date: Sep 2012
Posts: 17

Quote:
 Originally Posted by Qingl This paper present a good way to estimate genome size with low cov shotgun data, though it requires an assembled transcriptome as reference...
http://www.ncbi.nlm.nih.gov/pubmed/21266071

11-14-2012, 08:27 AM   #11
SES
Senior Member

Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275

Quote:
 Originally Posted by Qingl http://www.ncbi.nlm.nih.gov/pubmed/21266071
This is an interesting approach that I had not seen. It is not clear how/if the efficacy of the method was evaluated but it is something to explore. Thanks for the response.

11-14-2012, 10:21 AM   #12
Qingl
Member

Location: salt lake city

Join Date: Sep 2012
Posts: 17

Quote:
 Originally Posted by SES This is an interesting approach that I had not seen. It is not clear how/if the efficacy of the method was evaluated but it is something to explore. Thanks for the response.
Sure The method has control sample that would support the efficacy

11-14-2012, 10:50 AM   #13
SES
Senior Member

Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275

Quote:
 Originally Posted by Qingl Sure The method has control sample that would support the efficacy
Excellent. Do you mean control with known genome size? That is what I was wondering. I'll take a closer look at the paper and see if I can apply the methods to my system.

11-14-2012, 03:57 PM   #14
Qingl
Member

Location: salt lake city

Join Date: Sep 2012
Posts: 17

Quote:
 Originally Posted by SES Excellent. Do you mean control with known genome size? That is what I was wondering. I'll take a closer look at the paper and see if I can apply the methods to my system.
Yes, I agree it's an excellent method~~~~

01-22-2013, 01:25 PM   #16
SES
Senior Member

Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275

Quote:
This is a very helpful description of BGI's methods but I have one question. How do you determine M, what you call the mean k-mer coverage, from the k-mer frequency histogram produced by soapdenovo's pregraph program (the file with the *.kmerFreq extension), or with any other method? The first post in this thread quotes the methods from the Panda genome paper as saying M is the peak k-mer coverage (I guess on a normal distribution), and that is what is challenging to calculate.

 04-10-2013, 04:02 PM #17 samanta Senior Member   Location: Seattle Join Date: Feb 2010 Posts: 109 Hello folks, I am trying to understand the genome size estimation method, and here is the part that is not clear to me. BGI is estimating coverage depth by peak of frequency histogram, but in highly repetitive genomes the peak shifts down. E.g. check the second chart here, where we generated simulated reads from sea urchin genome at 50X coverage, but the histogram peak came at ~40 due to repeats. http://www.homolog.us/blogs/2011/09/...n-k-mer-world/ If that is true, how does BGI estimate genome size so precisely? In later section of bat paper, they claimed that the assembly size is close to the 'estimated size', which is puzzling given the impreciseness in genome size estimation. Maybe I am missing something. Please help !! --------------- Edit. I am rerunning the above simulation to make sure everything is done correctly. Results will be reported here. __________________ http://homolog.us Last edited by samanta; 04-10-2013 at 04:59 PM.
 04-10-2013, 05:51 PM #18 samanta Senior Member   Location: Seattle Join Date: Feb 2010 Posts: 109 Lizhenyu from BGI explained where I made the error in thinking. When a read has 100 nucleotides and is split into 21-mers, the read will produce only 80 k-mers, not 100. Here is his full response, which agrees with aaronrjex's post above. "Hi, I think you mixed the concepts of base coverage depth and kmer coverage depth. When you refered 50X genome coverage, it meant base coverage which is obtained by total_base_num/genome_size=(read_num*read_length)/genome_size. Similarly, the kmer coverage depth, the peak value in kmer frequency curve, is calculated by total_kmer_num/genome_size=read_num*(read_length-kmer_size+1)/genome_size. So the relationship between base coverage depth and kmer coverage depth is: kmer_coverage_depth = base_coverage_depth*(read_length-kmer_size+1)/read_length. In your case, kmer_coverage_depth = 50 * (100 - 21 + 1)/100 = 40, which is exactly the peak value in you plot. best," __________________ http://homolog.us
 09-10-2013, 06:19 AM #19 mflderks Junior Member   Location: Nijmegen Join Date: Apr 2012 Posts: 1 Ok, thanks for the nice explanation Last edited by mflderks; 09-27-2013 at 12:01 AM.