SEQanswers How to calculate nucleotide diversity from SNP data from multiple samples
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post Genomics101 Bioinformatics 0 08-20-2013 12:06 PM yfwang Bioinformatics 1 04-30-2013 09:58 PM puggie RNA Sequencing 8 04-09-2013 04:23 AM andtill Bioinformatics 1 11-09-2012 01:49 PM shuang Bioinformatics 2 09-07-2011 03:06 PM

 09-12-2013, 01:32 AM #1 Sonderkar Junior Member   Location: Denmarrk, Aalborg Join Date: Jan 2009 Posts: 3 How to calculate nucleotide diversity from SNP data from multiple samples Hi all, I have a collection of SNP data from a tetraploid population, and I would like to calculate the nucleotide diversity. The generel forumula is: where xi and xj are the respective frequencies of the ith and jth sequences, (PI)ij is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample. This is fairly simple if we have sequence data (and not SNP data). If we have these 3 sequences in a (very small) population, and we have sampled them each 1 time, we need to compare 1vs2, 1vs3, and 2vs3 >1 ATGCGTTTTT >2 ATGGGTTTTT >3 ATGCGTTTTA We now get: (PI)12 = 1/10 (they differ at pos. 4) (PI)13 = 1/10 (they differ at pos. 10) (PI)23 = 2/10 (they differ at pos. 4 and 10) n = 3 The respective frequencies are always 1/3 in this case this means that: PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) ) => PI = 2*(1/3^2)*(1/10+1/10+2/10) => PI = (2/9)*(2/5) => PI = 4/45 = 0.0888 I have a (VERY small popoluation of 2), and I have measured 2 SNPs: Individual 1: a G/C SNP at position 3. Individual 2: a T/A SNP at position 10. Lets say the reference sequence is: >ref ATGCGTTTTT Q1: Now I want to calculate the nucleotide diversity, would it then be correct to state that I have actually 4 sequences (4 alleles), meaning that I can calculate PI as follows: Based on the SNP information: >1 Individual 1 - allele 1 ATGCGTTTTT >2 Individual 1 - allele 2 ATCCGTTTTT >3 Individual 2 - allele 1 ATGCGTTTTT >4 Individual 2 - allele 2 ATGCGTTTTA We now get: (PI)12 = 1/10 (they differ at pos. 3) (PI)13 = 0 (they do not differ) (PI)14 = 1/10 (they differ at pos. 10) (PI)23 = 1/10 (they differ at pos. 3) (PI)24 = 2/10 (they differ at pos. 3 and 10) (PI)34 = 1/10 (they differ at pos. 10) n = 4 The respective frequencies are always 1/4, and can be moved out of the parenthesis and squared (see above). in this case PI is: PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) ) => PI = 2*(1/4^2)*(1/10+0+1/10+1/10+2/10+1/10) => PI = (2/16)*(5/10) => PI = 1/16 The question is whether this splitting of SNP information into alleles, and hence sequences is correct? Now, as I stated I have a tetraploid population, and I therefore plan to do the same, but expand each SNP into 4 alleles: I have a (VERY small population of 2), and I have measured 2 SNPs: Individual 1: a G/G/C/C SNP at position 3. Individual 2: a T/T/T/A SNP at position 10. Lets say the reference sequence is still: >ref ATGCGTTTTT Based on the SNP information: >1 Individual 1 - allele 1 ATGCGTTTTT >2 Individual 1 - allele 2 ATGCGTTTTT >3 Individual 1 - allele 3 ATCCGTTTTT >4 Individual 1 - allele 4 ATCCGTTTTT >5 Individual 2 - allele 1 ATGCGTTTTT >6 Individual 2 - allele 2 ATGCGTTTTT >7 Individual 2 - allele 3 ATGCGTTTTT >8 Individual 2 - allele 4 ATGCGTTTTA n = 8 (PI)12 = 0 (they do not differ) (PI)13 = 1/10 (they differ at pos. 3) (PI)14 = 1/10 (they differ at pos. 3) (PI)15 = 0 (they do not differ) (PI)16 = 0 (they do not differ) (PI)17 = 0 (they do not differ) (PI)18 = 1/10 (they differ at pos. 10) (PI)23 = 1/10 (they differ at pos. 3) (PI)24 = 1/10 (they differ at pos. 3) (PI)25 = 0 (they do not differ) (PI)26 = 0 (they do not differ) (PI)27 = 0 (they do not differ) (PI)28 = 1/10 (they differ at pos. 10) (PI)34 = 0 (they do not differ) (PI)35 = 1/10 (they differ at pos. 3) (PI)36 = 1/10 (they differ at pos. 3) (PI)37 = 1/10 (they differ at pos. 3) (PI)38 = 2/10 (they differ at pos. 3 and 10) (PI)45 = 1/10 (they differ at pos. 3) (PI)46 = 1/10 (they differ at pos. 3) (PI)47 = 1/10 (they differ at pos. 3) (PI)48 = 2/10 (they differ at pos. 3 and 10) (PI)56 = 0 (they do not differ) (PI)57 = 0 (they do not differ) (PI)58 = 1/10 (they differ at pos. 10) (PI)67 = 0 (they do not differ) (PI)68 = 1/10 (they differ at pos. 10) (PI)78 = 1/10 (they differ at pos. 10) Now, the respective frequencies are always 1/8, and can be moved out of the parenthesis and squared (see above). in this case PI is: SUM( PI(XJ) ) = 19/10 PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) ) => PI = 2*(1/8^2)*(19/10) => PI = (1/32)*(19/10) => PI = 19/320 = 0.059375 The question is (again) whether this splitting of SNP information into alleles, and hence sequences is correct?
 11-12-2013, 02:50 AM #2 bondsergy Junior Member   Location: Italy Join Date: Nov 2013 Posts: 1 I think there is a mistake in your formula for pi, you are missing a factor n/n-1, which is negligible when n is large and often omitted, but this is not your case. The nucleotide diversity in your 1st example should be equal to 2/15 (not 4/45, total 4 differences out of 30 comparisons=4/30=2/15).
 11-19-2013, 11:34 AM #3 rcapper Member   Location: Austin, TX Join Date: Sep 2011 Posts: 20 I am doing a very similar analysis. However, my SNP data is based off RAD data and the number of individuals genotyped at any given base may change (i.e., the number of "chromosomes" considered per each site). Therefore, using the n/(n-1) factor can't work for my data; instead, I am going through each base and calculating pairwise differences, then averaging over my arbitrary window size. For example, if I have the following five sites, the first two may only be genotyped for four 'chromosomes' and the last three for six. site: 1 2 3 4 5 chr1 A A T G C chr2 A A T G C chr3 A A T C C chr4 A T T C C chr5 [ ][ ]T C G chr6 [ ][ ]A C G site 1: 0 pairwise differences (4 chr sampled) site 2: 3/6 differences (4 chr sampled) site 3: 5/15 (6 chr sampled) site 4: 8/15 (6 sampled) site 5: 8/15 (6 sampled) pi = 0.38 for this five-base window. Another thing that I figured out while working with pi and RAD data is that you have to be really careful if you're using a vcf file! vcfs obviously (usually) only contain variants; however, pi must consider the number of GENOTYPED bases, INCLUDING invariants! Other tools such as vcftools only use the pi positional data and assume all bases not included in the vcf are invariants. This will give you a biased estimate. As far as your specific questions, I think that yes, you are on the right track. I don't think that pi cares if your data is diploid or tetraploid; it simply samples the number of pairwise differences among chromosomes considered, correct? The only real difference between my example here and yours above are that you are considering differences among stretches of bases and I am walking along the chromosome. I think each method should give you the same answer as long as you have the same 'n' along the stretch of bases (and, importantly, you use the n/(n-1) factor as mentioned by bondsergy). I don't, therefore the base-by-base works best for me. You might also find this blog post useful: http://binhe.org/2011/12/29/calculat...mation-method/ I'm not sure this will help you, but I'd like to get it out on the forums in case it helps someone.
 02-23-2014, 11:48 PM #4 travc Junior Member   Location: Davis, CA Join Date: Aug 2013 Posts: 4 I thought I'd offer an improvement. sum( 2j(n-j) / n(n-1) ) only works for bi-allelic loci, but generalizing is easy. If you take a list of allele counts (Ji values) INCLUDING THE MAJOR ALLELE, sum( sum_i( Ji (n-Ji) ) / (n(n-1)) ) ) / L where L is the number of sites/positions compared. (BTW: sum_i is read as "sum over i", and Ji should really be "J sub i") I haven't rigorously done the proof, but it intuitively makes sense and agrees with the value you get by exhaustively looking at all the pair values. So I'm pretty certain it is correct. Ex: 3 A, 2 C, 2 G = (12+10+10)/42 = 16/21 Some python testing code: Code: #!/usr/bin/python from collections import Counter import itertools def calcPi(seq_list, transposed=True): """ seq_list is an iteratable of iterables containing alleles by [position][seq_number] if transposed=False, then seq_list is by [seq_number][position] this format is slightly less efficient """ if( transposed == False ): seq_list = itertools.izip(*seq_list) num_sites = 0 sum_pi = 0 for allele_list in seq_list: counts = Counter(allele_list) n = sum(counts.values()) num_sites += 1 sum_pi += sum(( j*(n-j) for j in counts.values() )) / (n*(n-1.0)) return sum_pi / float(num_sites) # sequence pi is just mean of per-position pi values # Test it! import random import time LEN = 1000 NUM_SEQ = 10 NUM_ITERATIONS = 10 NUCLEOTIDES = ['A','T','C','G'] NUCLEOTIDES.extend(['A']*8) # over-represent 'A' to make sequences more similar #random.seed(0) CMP_EPSILON = 1e-10 # how close do results have to be to call them equal tics = 0 exhaustive_tics = 0 for _ in range(NUM_ITERATIONS): # generate NUM_SEQ random sequences of length LEN (trasposed so it is by [position][seq_number] At = [ [ random.choice(NUCLEOTIDES) for _ in range(NUM_SEQ) ] for _ in range(LEN) ] tic = time.clock() seq_pi = calcPi(At, transposed=True) tics += time.clock()-tic # exhaustive (all-pairs) per-sequence method A = map(list, itertools.izip(*At)) # transpose At so it is [seq_num][pos] tic = time.clock() pair_pi = [] for i in range(len(A)): for j in range(i): t = 0 d = 0 for k in range(len(A[i])): t += 1 if( A[i][k] != A[j][k] ): d += 1 pair_pi.append( 2.0*((1.0/float(len(A)))**2)*(d/float(t)) ) #print i,j, A[i],A[j], d,t, pair_pi[-1] seq_pi_exhaustive = (len(A)/(len(A)-1.0))*sum(pair_pi) exhaustive_tics += time.clock()-tic # output if( abs(seq_pi-seq_pi_exhaustive) > CMP_EPSILON ): print "MISMATCH", seq_pi,'!=', seq_pi_exhaustive print NUM_ITERATIONS, "cycles" print "calcuation took",tics print "exhaustive took",exhaustive_tics
 03-30-2014, 05:22 AM #5 irma42 Junior Member   Location: Indonesia Join Date: Mar 2014 Posts: 1 Adenin base insertion Hi All, Iam Irma, new comer from Bogor Agriculture University, Indonesia. I have 180 DNA sequence samples from duck population. When i analyze the data, why insertion of Adenin base always uniform in all of DNA samples? Another case is, when there are C to A mutation, in another place we found A deletion. Its like an equilibrium to balancq A composition. Any one who can explain this phenomenon? Thank you All
 09-09-2014, 03:32 AM #6 Dmelnet Junior Member   Location: Barcelona, Spain Join Date: Sep 2014 Posts: 5 unbiased nucleotide diversity estimation from SNP data Thanks for all your useful discussion about the topic! The only doubt I have regarding your formulas is a small discrepancy in the denominator of the base pair summation method. Both Travis Collier and Bin He propose the following formula: 2j(n-j)/(n(n-1)) here and in the blog entry mentioned above (http://binhe.org/2011/12/29/calculat...mation-method/). However, Begun et al., 2007 uses a formula that can be transformed to 2j(n-j)/(n(n-j)) (if I didn't make any mistake in the process...). I don't understand where this discrepancy comes from and which formula is the correct one... Another point that is not clear for me is if j reffers to (i)the count of the minor allele, (ii) the count of the alternative allele (either major or minor) or (iii) the count of the derived allele (if ancestral state is available). I could not find any publication explicitly showing the applicability of the base pair summation method for estimating pi... Could you find any? or have you published your results so that I can cite you? The purpose of my analysis is to accurately estimate nucleotide diversity from a cohort of 43 genomes (in fact, two populations of 16 and 27 genomes). Thank you very much! Begun's paper: (http://www.plosbiology.org/article/i...bio.0050310#s3)
09-16-2014, 01:01 AM   #7
dschika
Member

Location: Zurich

Join Date: Mar 2010
Posts: 56

Hi Dmelnet,

I think there is a typo in the Begun paper. If you compare the first part of the formular with the second, I think it should be 2j(n-j)/(n(n-1)).

Quote:
 I could not find any publication explicitly showing the applicability of the base pair summation method for estimating pi... Could you find any? or have you published your results so that I can cite you?
This is something I'm also interested in!

 09-16-2014, 02:43 AM #8 Dmelnet Junior Member   Location: Barcelona, Spain Join Date: Sep 2014 Posts: 5 typo found Thank you dschika! I wanted to use vcf tools for calculating pi but I have realized that, not only treates missing positions in the vcf file as invariant sites but it also treats missing genotypes as invariant sites (hom ref)! I think the correct way would be using the total number of genotyped alleles per position instead of just the total number of alleles per position (genotyped or missing data). Do you know if this is already implemented in any other tool? I will have a look at this: "PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R" (http://mbe.oxfordjournals.org/conten.../molbev.msu136) and I will let you know. If you have experience with it, share it also!
 09-16-2014, 03:57 AM #9 dschika Member   Location: Zurich Join Date: Mar 2010 Posts: 56 I also use vcftools a lot, but, because of the reasons you mentioned, it is not a good solution for pi calculation in this case. Unfortunately, I have no experience with PopGenome. I also searched a bit, but could not find an existing tool for this task. I implemented now the Begun method (with c(c-1)) in the denominator of the nominator of the pi formula) in a python script using a vcf file and a 'coverage' file generated with bedtools for each individual. Those files report the coverage of each position to a given reference. From the bedtools files I can calculate how many bases in total were covered by c individuals and from the vcf file I get the variants with coverage c and the number of j (equals the number of 1's) in it. With this approach, I do not use missing data or uncovered positions.
 09-16-2014, 04:54 AM #10 Dmelnet Junior Member   Location: Barcelona, Spain Join Date: Sep 2014 Posts: 5 Me too! I cannot belive it! I have done exactly the same in perl!!! Well... a slightly different implementation calculating a pi per site that is stored in the INFO field of the vcf. Then, it can be used to calculate a pi per window (summation of pi per site and then division by the number of positions analysed). I was afraid I was wasting my time if this was already implemented somewhere else... but I have specific filtering criteria for the vcf calls and I also want to calculate other statistics such as Fst, watterson's theta, and all that stuff... and I have to finish all this in a couple of weeks!
09-16-2014, 05:19 AM   #11
dschika
Member

Location: Zurich

Join Date: Mar 2010
Posts: 56

Sounds good!

What exactly is your divisor in
Quote:
 (summation of pi per site and then division by the number of positions analysed)
?

The number of variant bases or a window size of e.g. 1000 bp?
If you divide by a certain window size: Do you adjust somewhere for different coverages in the non-variant sites?

This was one of my problems, as I had different coverages over my sample and could not assume that all non-variant sites had same coverage. I mean, if I have a position that is covered by 5 of 20 individuals, I cannot assume that the remaining 15 individuals all show the reference nucleotide at that position. Therefore, I used the Begun method which takes coverages into account.

 09-16-2014, 06:35 AM #12 Dmelnet Junior Member   Location: Barcelona, Spain Join Date: Sep 2014 Posts: 5 my calculation of pi per site is the following: sub PixSite(){ my \$c=\$_[0]; my \$j=\$_[1]; my \$pi=0; \$pi=(2*\$j*(\$c-\$j))/(\$c*(\$c-1)); return \$pi; } where c is the number of covered positions (e.g 15 out of 20 individuals would give 30 alleles at this site, it corresponds to AN field in GATK INFO column of your vcf file) and j is the number of alternate alleles (AC field in the GATK INFO column of your vcf file, e.g if 5 individuals are homozygous for the alternate allele, j would be 10). This comes from a small rearrangement of Begun's formula and from the definition itself. Once I have a pi per site adjusted for different coverages, I can calculate average pi per window. For doing this, I count the number of analysed positions in the window of 1000bp that pass certain filtering criteria taking into account both variant and invariant sites (e.g 990 out of 1000bp have good coverage in 5 or more genomes, and are not part of repetitive elements, and and and...). Then I divide the summatory of pi per site by the number of analysed positions in the window (990). It is not necessary to adjust for the heterogeneous coverage in the invariant sites because pi in those sites is 0 and it doesn't matter the value of the denominator c(c-1). I hope it helps! By the way, are you working on drosophila?
09-16-2014, 07:39 AM   #13
dschika
Member

Location: Zurich

Join Date: Mar 2010
Posts: 56

Quote:
 It is not necessary to adjust for the heterogeneous coverage in the invariant sites because pi in those sites is 0 and it doesn't matter the value of the denominator c(c-1).
Yes, pi is zero in those regions. But the coverages of invariant sites are important (as far as I understood it), as the coverage of the invariant sites influences the total amount of possible pairwise comparisions. Looking again at the denominator of the Begun method, they explain that:
Quote:
 n_c is the number of aligned base pairs in the genomic region
which I interpreted to be coverage depended. Maybe I miss something here?

Quote:
 By the way, are you working on drosophila?
I work with plants, mainly a conifer species and crops.

Have you published your complete script already somewhere? Or do you plan to do so?

 09-16-2014, 07:47 AM #14 Dmelnet Junior Member   Location: Barcelona, Spain Join Date: Sep 2014 Posts: 5 If I interpreted correctly the formula, you can expand the summatories and reorganize them such that you get what I posted above, where the coverage in invariant sites does not influence the final result. I did the calculations by hand for a few examples and it worked... When I finish editing the code I can upload it somewhere so that other people can use it (at their own risk). The paper still needs a lot of work to get published but hopefully it will be available in a mid-term future ;-) I will leave the lab where I am currently doing research on December and I will defend my thesis on July 2015, so I will try to have something more or less finished by then...
 01-21-2015, 02:37 PM #15 jsweagley Junior Member   Location: Minneapolis Join Date: Jan 2015 Posts: 1 I have used the formula suggested by Travis Collier to obtain Pi values on a per-site basis from a vcf in Python. I now want to filter the sites and calculate mean and median pi values in windows, similar to what Dmelnet did. I was wondering if anybody has any suggestions about how to efficiently obtain these windowed results. I am new to programming and bioinformatics, so your help is greatly appreciated. Also, I don't know if anyone here is analyzing multiple populations, or would like to calculate Dxy, but I think this formula works using the variables stored for the pi calculations and will give you per-site Dxy between two populations. ((n1*n2)-sum_i(J1_i*J2_i))/(n1*n2) for population 1 and population 2 where J_i is the allele count (# of 0,1,2, or 3's) at a given site in the vcf for either population. Code: if n_seq_POP_1 > 0 and n_seq_POP_2 >0 : POP_1_POP_2_dxy=((n_seq_POP_1 * n_seq_POP_2)- ((j0_POP_1 * j0_POP_2)+(j1_POP_1 * j1_POP_2)+(j2_POP_1 * j2_POP_2)+(j3_POP_1*j3_POP_2)))/(n_seq_POP_1*n_seq_POP_2) Thank you, Jimmy
 06-11-2015, 01:31 AM #16 ymc Senior Member   Location: Hong Kong Join Date: Mar 2010 Posts: 497 How do we calculate nucleotide diversity when there are pairwise insertions/deletions ?
06-11-2015, 02:13 AM   #17
travc
Junior Member

Location: Davis, CA

Join Date: Aug 2013
Posts: 4

Quote:
 Originally Posted by ymc How do we calculate nucleotide diversity when there are pairwise insertions/deletions ?
I think the most standard approach is to just ignore them (only use biallelic SNPs). This seem good enough for most purposes.

I'm not a all sure how one would properly incorporate them. I suppose it depends on what exactly you are trying to really measure.
One simple approach would be to treat both SNPs and indels as generic markers/alleles. You wouldn't be calculating "nucleotide diversity" as such, more like "allelic diversity". One potential complexity is the different mutation rates, which could mess up estimating stuff like rates of divergence using the metric.

BTW: No one really cares about nucleotide diversity... It is easy to measure and correlated with or a proxy for stuff we do care about though.

 Tags nucleotide diversity, snp, snp data, snv