SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genome size estimation moinul De novo discovery 9 04-04-2014 03:22 AM
Genome size quantification tboothby General 1 11-18-2011 07:28 AM
gsAssembler - predicted genome size? Jordy224 Bioinformatics 2 11-22-2010 09:27 PM
Estimating genome size and coverage newbie25 454 Pyrosequencing 2 08-12-2010 09:34 AM
Using AFLP to reduce template genome size thomasvangurp Bioinformatics 0 08-13-2009 10:34 AM

Reply
 
Thread Tools
Old 04-27-2011, 08:02 AM   #1
yanij
Junior Member
 
Location: Hong Kong

Join Date: Apr 2011
Posts: 3
Default How to estimating the genome size

Hi everyone, how can we actually estimate the genome size if there does not exist a reference genome or any genome that is significantly close enough to your sample?

One genome estimation method used by the BGI in assembling the Giant panda genome is to use the 17-mer. I don't quite get their idea, would anyone help explain in this?

From their supplementary, " Distribution of 17-mer frequency in the raw sequencing reads. We used all reads from the short insert-size libraries (<500bp). The peak depth is at 15X. The peak of 17-mer frequency (M) in reads is correlated with the real sequencing depth (N), read length (L), and kmer length (K), their relations can be expressed in a experienced formula: M = N * (L K + 1) / L. Then, we divided the total sequence length by the real sequencing depth and obtained an estimated the genome size of 2.46 Gb."

FYR, the paper is titled as "The sequence and de novo assembly of the giant panda genome"
yanij is offline   Reply With Quote
Old 05-09-2011, 02:17 AM   #2
figure002
Member
 
Location: Rotterdam

Join Date: Jan 2011
Posts: 11
Default

I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
figure002 is offline   Reply With Quote
Old 05-09-2011, 02:41 AM   #3
yanij
Junior Member
 
Location: Hong Kong

Join Date: Apr 2011
Posts: 3
Default

Quote:
Originally Posted by figure002 View Post
I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
Hey thanks for the information, so far I have just gone through the program Tallymer

It works great but somehow if you have a very large read size the file size of the suffix tree gonna be tremendously large. There is another paper from Waterman showing how to predict the genome size from the k-mer - Estimating the Repeat Structure and Length of DNA Sequences Using ℓ-Tuples

Do you have any idea on the speed and memory consumption of jellyfish?
yanij is offline   Reply With Quote
Old 05-11-2011, 12:28 AM   #4
figure002
Member
 
Location: Rotterdam

Join Date: Jan 2011
Posts: 11
Default

Thanks for pointing me to Tallymer. I was looking for more tools to test on our data.

Another tool I found is GSP, but it doesn't look like a very a decent tool. The source package is all messy and I couldn't find it in a publication.

I've only just finished compiling jellyfish (jellyfish 1.1 has some compilation issues, but we just received a patch from the developer that fixes these issues). So I can't tell you anything about performance right now. I'll report back as soon as I have some results.
figure002 is offline   Reply With Quote
Old 05-11-2011, 02:59 AM   #5
yanij
Junior Member
 
Location: Hong Kong

Join Date: Apr 2011
Posts: 3
Default

Quote:
Originally Posted by figure002 View Post
Thanks for pointing me to Tallymer. I was looking for more tools to test on our data.

Another tool I found is GSP, but it doesn't look like a very a decent tool. The source package is all messy and I couldn't find it in a publication.

I've only just finished compiling jellyfish (jellyfish 1.1 has some compilation issues, but we just received a patch from the developer that fixes these issues). So I can't tell you anything about performance right now. I'll report back as soon as I have some results.
Hi figure002, I guess the compilation error that u've encountered is this - "warnings being treated as errors". In my case, simply remove all the "-Werror" in the Makefiles would do.

Hope this may help.
yanij is offline   Reply With Quote
Old 05-11-2011, 04:05 AM   #6
figure002
Member
 
Location: Rotterdam

Join Date: Jan 2011
Posts: 11
Default

Quote:
Originally Posted by yanij View Post
Hi figure002, I guess the compilation error that u've encountered is this - "warnings being treated as errors". In my case, simply remove all the "-Werror" in the Makefiles would do.

Hope this may help.
True, that's what I did at first. But the developer was so kind to fix the source which makes removing the -Werror unnecessary. He said he would upload the new package. That should at least make things easier.
figure002 is offline   Reply With Quote
Old 05-17-2011, 07:40 AM   #7
figure002
Member
 
Location: Rotterdam

Join Date: Jan 2011
Posts: 11
Default

Quote:
Originally Posted by yanij View Post
Do you have any idea on the speed and memory consumption of jellyfish?
I did some runs of both jellyfish and tallymer on test data, and I noticed that jellyfish is much faster (it was running with 32 threads) when it comes to k-mer counting. According to the Jellyfish paper, "Jellyfish offers a much faster and more memory-efficient solution" than suffix arrays, which are used in Tallymer I believe.

At this moment I'm running "tallymer suffixerator" and "jellyfish count" next to each other on a machine with 32 cores. "jellyfish count" is using around 0.2% memory, while "tallymer suffixerator" is using around 3.0% memory.

Thus so far I can confirm that Jellyfish is indeed faster and more memory efficient.
figure002 is offline   Reply With Quote
Old 11-14-2012, 06:52 AM   #8
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 274
Default

Quote:
Originally Posted by figure002 View Post
I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
I think that this estimation is based on a gamma distribution and is similar to the calculations made by Quake, Ray, ABySS, etc. and depend on something like 10-15X coverage. With low coverage, my experience is that the distribution of k-mer copies will show an exponential decay with the rate of decay depending on the repeat content and k-mer length. Does anyone else have experience in trying to make these calculations? It would be great if there was a way to make a reasonable estimate from lower coverage shotgun data.
SES is offline   Reply With Quote
Old 11-14-2012, 07:24 AM   #9
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Quote:
Originally Posted by SES View Post
I think that this estimation is based on a gamma distribution and is similar to the calculations made by Quake, Ray, ABySS, etc. and depend on something like 10-15X coverage. With low coverage, my experience is that the distribution of k-mer copies will show an exponential decay with the rate of decay depending on the repeat content and k-mer length. Does anyone else have experience in trying to make these calculations? It would be great if there was a way to make a reasonable estimate from lower coverage shotgun data.
This paper present a good way to estimate genome size with low cov shotgun data, though it requires an assembled transcriptome as reference...
Qingl is offline   Reply With Quote
Old 11-14-2012, 07:25 AM   #10
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Quote:
Originally Posted by Qingl View Post
This paper present a good way to estimate genome size with low cov shotgun data, though it requires an assembled transcriptome as reference...
http://www.ncbi.nlm.nih.gov/pubmed/21266071
Qingl is offline   Reply With Quote
Old 11-14-2012, 08:27 AM   #11
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 274
Default

Quote:
Originally Posted by Qingl View Post
This is an interesting approach that I had not seen. It is not clear how/if the efficacy of the method was evaluated but it is something to explore. Thanks for the response.
SES is offline   Reply With Quote
Old 11-14-2012, 10:21 AM   #12
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Quote:
Originally Posted by SES View Post
This is an interesting approach that I had not seen. It is not clear how/if the efficacy of the method was evaluated but it is something to explore. Thanks for the response.
Sure The method has control sample that would support the efficacy
Qingl is offline   Reply With Quote
Old 11-14-2012, 10:50 AM   #13
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 274
Default

Quote:
Originally Posted by Qingl View Post
Sure The method has control sample that would support the efficacy
Excellent. Do you mean control with known genome size? That is what I was wondering. I'll take a closer look at the paper and see if I can apply the methods to my system.
SES is offline   Reply With Quote
Old 11-14-2012, 03:57 PM   #14
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Quote:
Originally Posted by SES View Post
Excellent. Do you mean control with known genome size? That is what I was wondering. I'll take a closer look at the paper and see if I can apply the methods to my system.
Yes, I agree it's an excellent method~~~~
Qingl is offline   Reply With Quote
Old 01-10-2013, 04:01 PM   #15
aaronrjex
Junior Member
 
Location: Australia

Join Date: Aug 2012
Posts: 7
Default

Hi yanij

I don't know how useful this will be to you give the time since your post, but just in case....

The BGI method is based around the observation that the coverage achieved for a genome is based on the size of the genome and the total amount of sequence data generated. So if you sequence 100 Mb of data for a 10 Mb genome, you should get ~10-fold coverage.

Or as a simple equation: depth of coverage = total data / genome length.

If you have any two of these parameters (i.e., you know the amount of data you generated and you know the genome size) obviously you can calculate the third.

Usually when doing de novo genome sequencing you don't know the genome size, and since you don't have the genome, you don't know the coverage, but you do know how much data you've generated (i.e., the 'total sequence length' to use BGI's term). To estimate the genome size, you then need to estimate the coverage depth (N).

To do this, you can calculate the kmer frequency within your read data (most people will do this for one of their small insert libraries for which they have the most information). Meaning you chop all of the reads you've generated up in to kmers (a kmer of 17 is the most common, as it is long enough to yield fairly specific sequences (meaning that its unlikely the kmer is repeated throughout the genome by chance), but short enough to give you lots of data). You then count the frequency with which each 17-mer represented by your data is found among all of the reads generated and create a frequency histogram of this information. For non-repetitive regions of the genome, this histogram should be normally distributed around a single peak (although in real data you will have a asymptote near 1 because of rare sequencing errors etc). This peak value (or peak depth) is the mean kmer coverage for your data.

You can relate this value to the actual coverage of your genome using the formula M = N * (L – K + 1) / L, where M is the mean kmer coverage, N is the actual coverage of the genome, L is the mean read length and k is the kmer size.

L -k +1 gives you the number of kmers created per read.

So basically what the formula says is the kmer coverage for a genome is equal to the mean read coverage * the number of kmers per read divided by the read length.

Because you know L (your mean read length) and k (the kmer you used to estimate peak kmer coverage) and you've calculated M (soapdenovo comes with a script called kmerfreq that will this), you simply solve the equation for N as:

N = M/((L-k+1)/L) and calculate N.

Once you have that, divide your total sequence data by N and you have your genome estimate.

Hope that helps.

Last edited by aaronrjex; 01-10-2013 at 04:03 PM.
aaronrjex is offline   Reply With Quote
Old 01-22-2013, 01:25 PM   #16
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 274
Default

Quote:
Originally Posted by aaronrjex View Post
Hi yanij

I don't know how useful this will be to you give the time since your post, but just in case....

The BGI method is based around the observation that the coverage achieved for a genome is based on the size of the genome and the total amount of sequence data generated. So if you sequence 100 Mb of data for a 10 Mb genome, you should get ~10-fold coverage.

Or as a simple equation: depth of coverage = total data / genome length.

If you have any two of these parameters (i.e., you know the amount of data you generated and you know the genome size) obviously you can calculate the third.

Usually when doing de novo genome sequencing you don't know the genome size, and since you don't have the genome, you don't know the coverage, but you do know how much data you've generated (i.e., the 'total sequence length' to use BGI's term). To estimate the genome size, you then need to estimate the coverage depth (N).

To do this, you can calculate the kmer frequency within your read data (most people will do this for one of their small insert libraries for which they have the most information). Meaning you chop all of the reads you've generated up in to kmers (a kmer of 17 is the most common, as it is long enough to yield fairly specific sequences (meaning that its unlikely the kmer is repeated throughout the genome by chance), but short enough to give you lots of data). You then count the frequency with which each 17-mer represented by your data is found among all of the reads generated and create a frequency histogram of this information. For non-repetitive regions of the genome, this histogram should be normally distributed around a single peak (although in real data you will have a asymptote near 1 because of rare sequencing errors etc). This peak value (or peak depth) is the mean kmer coverage for your data.

You can relate this value to the actual coverage of your genome using the formula M = N * (L K + 1) / L, where M is the mean kmer coverage, N is the actual coverage of the genome, L is the mean read length and k is the kmer size.

L -k +1 gives you the number of kmers created per read.

So basically what the formula says is the kmer coverage for a genome is equal to the mean read coverage * the number of kmers per read divided by the read length.

Because you know L (your mean read length) and k (the kmer you used to estimate peak kmer coverage) and you've calculated M (soapdenovo comes with a script called kmerfreq that will this), you simply solve the equation for N as:

N = M/((L-k+1)/L) and calculate N.

Once you have that, divide your total sequence data by N and you have your genome estimate.

Hope that helps.
This is a very helpful description of BGI's methods but I have one question. How do you determine M, what you call the mean k-mer coverage, from the k-mer frequency histogram produced by soapdenovo's pregraph program (the file with the *.kmerFreq extension), or with any other method? The first post in this thread quotes the methods from the Panda genome paper as saying M is the peak k-mer coverage (I guess on a normal distribution), and that is what is challenging to calculate.
SES is offline   Reply With Quote
Old 04-10-2013, 04:02 PM   #17
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Hello folks,

I am trying to understand the genome size estimation method, and here is the part that is not clear to me.

BGI is estimating coverage depth by peak of frequency histogram, but in highly repetitive genomes the peak shifts down. E.g. check the second chart here, where we generated simulated reads from sea urchin genome at 50X coverage, but the histogram peak came at ~40 due to repeats.

http://www.homolog.us/blogs/2011/09/...n-k-mer-world/

If that is true, how does BGI estimate genome size so precisely? In later section of bat paper, they claimed that the assembly size is close to the 'estimated size', which is puzzling given the impreciseness in genome size estimation.

Maybe I am missing something. Please help !!

---------------

Edit. I am rerunning the above simulation to make sure everything is done correctly. Results will be reported here.
__________________
http://homolog.us

Last edited by samanta; 04-10-2013 at 04:59 PM.
samanta is offline   Reply With Quote
Old 04-10-2013, 05:51 PM   #18
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Lizhenyu from BGI explained where I made the error in thinking. When a read has 100 nucleotides and is split into 21-mers, the read will produce only 80 k-mers, not 100. Here is his full response, which agrees with aaronrjex's post above.



"Hi,

I think you mixed the concepts of base coverage depth and kmer coverage depth.
When you refered 50X genome coverage, it meant base coverage which is obtained by

total_base_num/genome_size=(read_num*read_length)/genome_size.

Similarly, the kmer coverage depth, the peak value in kmer frequency curve, is calculated by

total_kmer_num/genome_size=read_num*(read_length-kmer_size+1)/genome_size.

So the relationship between base coverage depth and kmer coverage depth is:

kmer_coverage_depth = base_coverage_depth*(read_length-kmer_size+1)/read_length.

In your case, kmer_coverage_depth = 50 * (100 - 21 + 1)/100 = 40, which is exactly
the peak value in you plot.

best,"
__________________
http://homolog.us
samanta is offline   Reply With Quote
Old 09-10-2013, 06:19 AM   #19
mflderks
Junior Member
 
Location: Nijmegen

Join Date: Apr 2012
Posts: 1
Default

Ok, thanks for the nice explanation

Last edited by mflderks; 09-27-2013 at 12:01 AM.
mflderks is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO