
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
How to calculate nucleotide diversity (pi and theta) using NextGen Data?  Genomics101  Bioinformatics  0  08202013 12:06 PM 
How to calculate the concordance of NGS calls to SNP array data  yfwang  Bioinformatics  1  04302013 09:58 PM 
The way to normalize RNASEQ coverage data for multiple samples?  puggie  RNA Sequencing  8  04092013 04:23 AM 
How do I extract partial sequence data (Fasta) from multiple hits in NCBI nucleotide?  andtill  Bioinformatics  1  11092012 01:49 PM 
SNP base calling for multiple samples  shuang  Bioinformatics  2  09072011 03:06 PM 

Thread Tools 
09122013, 01:32 AM  #1 
Junior Member
Location: Denmarrk, Aalborg Join Date: Jan 2009
Posts: 3

How to calculate nucleotide diversity from SNP data from multiple samples
Hi all,
I have a collection of SNP data from a tetraploid population, and I would like to calculate the nucleotide diversity. The generel forumula is: where xi and xj are the respective frequencies of the ith and jth sequences, (PI)ij is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample. This is fairly simple if we have sequence data (and not SNP data). If we have these 3 sequences in a (very small) population, and we have sampled them each 1 time, we need to compare 1vs2, 1vs3, and 2vs3 >1 ATGCGTTTTT >2 ATGGGTTTTT >3 ATGCGTTTTA We now get: (PI)12 = 1/10 (they differ at pos. 4) (PI)13 = 1/10 (they differ at pos. 10) (PI)23 = 2/10 (they differ at pos. 4 and 10) n = 3 The respective frequencies are always 1/3 in this case this means that: PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) ) => PI = 2*(1/3^2)*(1/10+1/10+2/10) => PI = (2/9)*(2/5) => PI = 4/45 = 0.0888 I have a (VERY small popoluation of 2), and I have measured 2 SNPs: Individual 1: a G/C SNP at position 3. Individual 2: a T/A SNP at position 10. Lets say the reference sequence is: >ref ATGCGTTTTT Q1: Now I want to calculate the nucleotide diversity, would it then be correct to state that I have actually 4 sequences (4 alleles), meaning that I can calculate PI as follows: Based on the SNP information: >1 Individual 1  allele 1 ATGCGTTTTT >2 Individual 1  allele 2 ATCCGTTTTT >3 Individual 2  allele 1 ATGCGTTTTT >4 Individual 2  allele 2 ATGCGTTTTA We now get: (PI)12 = 1/10 (they differ at pos. 3) (PI)13 = 0 (they do not differ) (PI)14 = 1/10 (they differ at pos. 10) (PI)23 = 1/10 (they differ at pos. 3) (PI)24 = 2/10 (they differ at pos. 3 and 10) (PI)34 = 1/10 (they differ at pos. 10) n = 4 The respective frequencies are always 1/4, and can be moved out of the parenthesis and squared (see above). in this case PI is: PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) ) => PI = 2*(1/4^2)*(1/10+0+1/10+1/10+2/10+1/10) => PI = (2/16)*(5/10) => PI = 1/16 The question is whether this splitting of SNP information into alleles, and hence sequences is correct? Now, as I stated I have a tetraploid population, and I therefore plan to do the same, but expand each SNP into 4 alleles: I have a (VERY small population of 2), and I have measured 2 SNPs: Individual 1: a G/G/C/C SNP at position 3. Individual 2: a T/T/T/A SNP at position 10. Lets say the reference sequence is still: >ref ATGCGTTTTT Based on the SNP information: >1 Individual 1  allele 1 ATGCGTTTTT >2 Individual 1  allele 2 ATGCGTTTTT >3 Individual 1  allele 3 ATCCGTTTTT >4 Individual 1  allele 4 ATCCGTTTTT >5 Individual 2  allele 1 ATGCGTTTTT >6 Individual 2  allele 2 ATGCGTTTTT >7 Individual 2  allele 3 ATGCGTTTTT >8 Individual 2  allele 4 ATGCGTTTTA n = 8 (PI)12 = 0 (they do not differ) (PI)13 = 1/10 (they differ at pos. 3) (PI)14 = 1/10 (they differ at pos. 3) (PI)15 = 0 (they do not differ) (PI)16 = 0 (they do not differ) (PI)17 = 0 (they do not differ) (PI)18 = 1/10 (they differ at pos. 10) (PI)23 = 1/10 (they differ at pos. 3) (PI)24 = 1/10 (they differ at pos. 3) (PI)25 = 0 (they do not differ) (PI)26 = 0 (they do not differ) (PI)27 = 0 (they do not differ) (PI)28 = 1/10 (they differ at pos. 10) (PI)34 = 0 (they do not differ) (PI)35 = 1/10 (they differ at pos. 3) (PI)36 = 1/10 (they differ at pos. 3) (PI)37 = 1/10 (they differ at pos. 3) (PI)38 = 2/10 (they differ at pos. 3 and 10) (PI)45 = 1/10 (they differ at pos. 3) (PI)46 = 1/10 (they differ at pos. 3) (PI)47 = 1/10 (they differ at pos. 3) (PI)48 = 2/10 (they differ at pos. 3 and 10) (PI)56 = 0 (they do not differ) (PI)57 = 0 (they do not differ) (PI)58 = 1/10 (they differ at pos. 10) (PI)67 = 0 (they do not differ) (PI)68 = 1/10 (they differ at pos. 10) (PI)78 = 1/10 (they differ at pos. 10) Now, the respective frequencies are always 1/8, and can be moved out of the parenthesis and squared (see above). in this case PI is: SUM( PI(XJ) ) = 19/10 PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) ) => PI = 2*(1/8^2)*(19/10) => PI = (1/32)*(19/10) => PI = 19/320 = 0.059375 The question is (again) whether this splitting of SNP information into alleles, and hence sequences is correct? 
11122013, 02:50 AM  #2 
Junior Member
Location: Italy Join Date: Nov 2013
Posts: 1

I think there is a mistake in your formula for pi, you are missing a factor n/n1, which is negligible when n is large and often omitted, but this is not your case. The nucleotide diversity in your 1st example should be equal to 2/15 (not 4/45, total 4 differences out of 30 comparisons=4/30=2/15).

11192013, 11:34 AM  #3 
Member
Location: Austin, TX Join Date: Sep 2011
Posts: 20

I am doing a very similar analysis. However, my SNP data is based off RAD data and the number of individuals genotyped at any given base may change (i.e., the number of "chromosomes" considered per each site). Therefore, using the n/(n1) factor can't work for my data; instead, I am going through each base and calculating pairwise differences, then averaging over my arbitrary window size.
For example, if I have the following five sites, the first two may only be genotyped for four 'chromosomes' and the last three for six. site: 1 2 3 4 5 chr1 A A T G C chr2 A A T G C chr3 A A T C C chr4 A T T C C chr5 [ ][ ]T C G chr6 [ ][ ]A C G site 1: 0 pairwise differences (4 chr sampled) site 2: 3/6 differences (4 chr sampled) site 3: 5/15 (6 chr sampled) site 4: 8/15 (6 sampled) site 5: 8/15 (6 sampled) pi = 0.38 for this fivebase window. Another thing that I figured out while working with pi and RAD data is that you have to be really careful if you're using a vcf file! vcfs obviously (usually) only contain variants; however, pi must consider the number of GENOTYPED bases, INCLUDING invariants! Other tools such as vcftools only use the pi positional data and assume all bases not included in the vcf are invariants. This will give you a biased estimate. As far as your specific questions, I think that yes, you are on the right track. I don't think that pi cares if your data is diploid or tetraploid; it simply samples the number of pairwise differences among chromosomes considered, correct? The only real difference between my example here and yours above are that you are considering differences among stretches of bases and I am walking along the chromosome. I think each method should give you the same answer as long as you have the same 'n' along the stretch of bases (and, importantly, you use the n/(n1) factor as mentioned by bondsergy). I don't, therefore the basebybase works best for me. You might also find this blog post useful: http://binhe.org/2011/12/29/calculat...mationmethod/ I'm not sure this will help you, but I'd like to get it out on the forums in case it helps someone. 
02232014, 11:48 PM  #4 
Junior Member
Location: Davis, CA Join Date: Aug 2013
Posts: 4

I thought I'd offer an improvement.
sum( 2j(nj) / n(n1) ) only works for biallelic loci, but generalizing is easy. If you take a list of allele counts (Ji values) INCLUDING THE MAJOR ALLELE, sum( sum_i( Ji (nJi) ) / (n(n1)) ) ) / L where L is the number of sites/positions compared. (BTW: sum_i is read as "sum over i", and Ji should really be "J sub i") I haven't rigorously done the proof, but it intuitively makes sense and agrees with the value you get by exhaustively looking at all the pair values. So I'm pretty certain it is correct. Ex: 3 A, 2 C, 2 G = (12+10+10)/42 = 16/21 Some python testing code: Code:
#!/usr/bin/python from collections import Counter import itertools def calcPi(seq_list, transposed=True): """ seq_list is an iteratable of iterables containing alleles by [position][seq_number] if transposed=False, then seq_list is by [seq_number][position] this format is slightly less efficient """ if( transposed == False ): seq_list = itertools.izip(*seq_list) num_sites = 0 sum_pi = 0 for allele_list in seq_list: counts = Counter(allele_list) n = sum(counts.values()) num_sites += 1 sum_pi += sum(( j*(nj) for j in counts.values() )) / (n*(n1.0)) return sum_pi / float(num_sites) # sequence pi is just mean of perposition pi values # Test it! import random import time LEN = 1000 NUM_SEQ = 10 NUM_ITERATIONS = 10 NUCLEOTIDES = ['A','T','C','G'] NUCLEOTIDES.extend(['A']*8) # overrepresent 'A' to make sequences more similar #random.seed(0) CMP_EPSILON = 1e10 # how close do results have to be to call them equal tics = 0 exhaustive_tics = 0 for _ in range(NUM_ITERATIONS): # generate NUM_SEQ random sequences of length LEN (trasposed so it is by [position][seq_number] At = [ [ random.choice(NUCLEOTIDES) for _ in range(NUM_SEQ) ] for _ in range(LEN) ] tic = time.clock() seq_pi = calcPi(At, transposed=True) tics += time.clock()tic # exhaustive (allpairs) persequence method A = map(list, itertools.izip(*At)) # transpose At so it is [seq_num][pos] tic = time.clock() pair_pi = [] for i in range(len(A)): for j in range(i): t = 0 d = 0 for k in range(len(A[i])): t += 1 if( A[i][k] != A[j][k] ): d += 1 pair_pi.append( 2.0*((1.0/float(len(A)))**2)*(d/float(t)) ) #print i,j, A[i],A[j], d,t, pair_pi[1] seq_pi_exhaustive = (len(A)/(len(A)1.0))*sum(pair_pi) exhaustive_tics += time.clock()tic # output if( abs(seq_piseq_pi_exhaustive) > CMP_EPSILON ): print "MISMATCH", seq_pi,'!=', seq_pi_exhaustive print NUM_ITERATIONS, "cycles" print "calcuation took",tics print "exhaustive took",exhaustive_tics 
03302014, 05:22 AM  #5 
Junior Member
Location: Indonesia Join Date: Mar 2014
Posts: 1

Adenin base insertion
Hi All,
Iam Irma, new comer from Bogor Agriculture University, Indonesia. I have 180 DNA sequence samples from duck population. When i analyze the data, why insertion of Adenin base always uniform in all of DNA samples? Another case is, when there are C to A mutation, in another place we found A deletion. Its like an equilibrium to balancq A composition. Any one who can explain this phenomenon? Thank you All 
09092014, 03:32 AM  #6 
Junior Member
Location: Barcelona, Spain Join Date: Sep 2014
Posts: 5

unbiased nucleotide diversity estimation from SNP data
Thanks for all your useful discussion about the topic!
The only doubt I have regarding your formulas is a small discrepancy in the denominator of the base pair summation method. Both Travis Collier and Bin He propose the following formula: 2j(nj)/(n(n1)) here and in the blog entry mentioned above (http://binhe.org/2011/12/29/calculat...mationmethod/). However, Begun et al., 2007 uses a formula that can be transformed to 2j(nj)/(n(nj)) (if I didn't make any mistake in the process...). I don't understand where this discrepancy comes from and which formula is the correct one... Another point that is not clear for me is if j reffers to (i)the count of the minor allele, (ii) the count of the alternative allele (either major or minor) or (iii) the count of the derived allele (if ancestral state is available). I could not find any publication explicitly showing the applicability of the base pair summation method for estimating pi... Could you find any? or have you published your results so that I can cite you? The purpose of my analysis is to accurately estimate nucleotide diversity from a cohort of 43 genomes (in fact, two populations of 16 and 27 genomes). Thank you very much! Begun's paper: (http://www.plosbiology.org/article/i...bio.0050310#s3) 
09162014, 01:01 AM  #7  
Member
Location: Zurich Join Date: Mar 2010
Posts: 56

Hi Dmelnet,
I think there is a typo in the Begun paper. If you compare the first part of the formular with the second, I think it should be 2j(nj)/(n(n1)). Quote:


09162014, 02:43 AM  #8 
Junior Member
Location: Barcelona, Spain Join Date: Sep 2014
Posts: 5

typo found
Thank you dschika!
I wanted to use vcf tools for calculating pi but I have realized that, not only treates missing positions in the vcf file as invariant sites but it also treats missing genotypes as invariant sites (hom ref)! I think the correct way would be using the total number of genotyped alleles per position instead of just the total number of alleles per position (genotyped or missing data). Do you know if this is already implemented in any other tool? I will have a look at this: "PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R" (http://mbe.oxfordjournals.org/conten.../molbev.msu136) and I will let you know. If you have experience with it, share it also! 
09162014, 03:57 AM  #9 
Member
Location: Zurich Join Date: Mar 2010
Posts: 56

I also use vcftools a lot, but, because of the reasons you mentioned, it is not a good solution for pi calculation in this case.
Unfortunately, I have no experience with PopGenome. I also searched a bit, but could not find an existing tool for this task. I implemented now the Begun method (with c(c1)) in the denominator of the nominator of the pi formula) in a python script using a vcf file and a 'coverage' file generated with bedtools for each individual. Those files report the coverage of each position to a given reference. From the bedtools files I can calculate how many bases in total were covered by c individuals and from the vcf file I get the variants with coverage c and the number of j (equals the number of 1's) in it. With this approach, I do not use missing data or uncovered positions. 
09162014, 04:54 AM  #10 
Junior Member
Location: Barcelona, Spain Join Date: Sep 2014
Posts: 5

Me too!
I cannot belive it! I have done exactly the same in perl!!! Well... a slightly different implementation calculating a pi per site that is stored in the INFO field of the vcf. Then, it can be used to calculate a pi per window (summation of pi per site and then division by the number of positions analysed).
I was afraid I was wasting my time if this was already implemented somewhere else... but I have specific filtering criteria for the vcf calls and I also want to calculate other statistics such as Fst, watterson's theta, and all that stuff... and I have to finish all this in a couple of weeks! 
09162014, 05:19 AM  #11  
Member
Location: Zurich Join Date: Mar 2010
Posts: 56

Sounds good! What exactly is your divisor in Quote:
The number of variant bases or a window size of e.g. 1000 bp? If you divide by a certain window size: Do you adjust somewhere for different coverages in the nonvariant sites? This was one of my problems, as I had different coverages over my sample and could not assume that all nonvariant sites had same coverage. I mean, if I have a position that is covered by 5 of 20 individuals, I cannot assume that the remaining 15 individuals all show the reference nucleotide at that position. Therefore, I used the Begun method which takes coverages into account. 

09162014, 06:35 AM  #12 
Junior Member
Location: Barcelona, Spain Join Date: Sep 2014
Posts: 5

my calculation of pi per site is the following:
sub PixSite(){ my $c=$_[0]; my $j=$_[1]; my $pi=0; $pi=(2*$j*($c$j))/($c*($c1)); return $pi; } where c is the number of covered positions (e.g 15 out of 20 individuals would give 30 alleles at this site, it corresponds to AN field in GATK INFO column of your vcf file) and j is the number of alternate alleles (AC field in the GATK INFO column of your vcf file, e.g if 5 individuals are homozygous for the alternate allele, j would be 10). This comes from a small rearrangement of Begun's formula and from the definition itself. Once I have a pi per site adjusted for different coverages, I can calculate average pi per window. For doing this, I count the number of analysed positions in the window of 1000bp that pass certain filtering criteria taking into account both variant and invariant sites (e.g 990 out of 1000bp have good coverage in 5 or more genomes, and are not part of repetitive elements, and and and...). Then I divide the summatory of pi per site by the number of analysed positions in the window (990). It is not necessary to adjust for the heterogeneous coverage in the invariant sites because pi in those sites is 0 and it doesn't matter the value of the denominator c(c1). I hope it helps! By the way, are you working on drosophila? 
09162014, 07:39 AM  #13  
Member
Location: Zurich Join Date: Mar 2010
Posts: 56

Quote:
Quote:
Quote:
Have you published your complete script already somewhere? Or do you plan to do so? 

09162014, 07:47 AM  #14 
Junior Member
Location: Barcelona, Spain Join Date: Sep 2014
Posts: 5

If I interpreted correctly the formula, you can expand the summatories and reorganize them such that you get what I posted above, where the coverage in invariant sites does not influence the final result. I did the calculations by hand for a few examples and it worked...
When I finish editing the code I can upload it somewhere so that other people can use it (at their own risk). The paper still needs a lot of work to get published but hopefully it will be available in a midterm future ;) I will leave the lab where I am currently doing research on December and I will defend my thesis on July 2015, so I will try to have something more or less finished by then... 
01212015, 02:37 PM  #15 
Junior Member
Location: Minneapolis Join Date: Jan 2015
Posts: 1

I have used the formula suggested by Travis Collier to obtain Pi values on a persite basis from a vcf in Python. I now want to filter the sites and calculate mean and median pi values in windows, similar to what Dmelnet did. I was wondering if anybody has any suggestions about how to efficiently obtain these windowed results. I am new to programming and bioinformatics, so your help is greatly appreciated.
Also, I don't know if anyone here is analyzing multiple populations, or would like to calculate Dxy, but I think this formula works using the variables stored for the pi calculations and will give you persite Dxy between two populations. ((n1*n2)sum_i(J1_i*J2_i))/(n1*n2) for population 1 and population 2 where J_i is the allele count (# of 0,1,2, or 3's) at a given site in the vcf for either population. Code:
if n_seq_POP_1 > 0 and n_seq_POP_2 >0 : POP_1_POP_2_dxy=((n_seq_POP_1 * n_seq_POP_2) ((j0_POP_1 * j0_POP_2)+(j1_POP_1 * j1_POP_2)+(j2_POP_1 * j2_POP_2)+(j3_POP_1*j3_POP_2)))/(n_seq_POP_1*n_seq_POP_2) Jimmy 
06112015, 01:31 AM  #16 
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498

How do we calculate nucleotide diversity when there are pairwise insertions/deletions ?

06112015, 02:13 AM  #17  
Junior Member
Location: Davis, CA Join Date: Aug 2013
Posts: 4

Quote:
I'm not a all sure how one would properly incorporate them. I suppose it depends on what exactly you are trying to really measure. One simple approach would be to treat both SNPs and indels as generic markers/alleles. You wouldn't be calculating "nucleotide diversity" as such, more like "allelic diversity". One potential complexity is the different mutation rates, which could mess up estimating stuff like rates of divergence using the metric. BTW: No one really cares about nucleotide diversity... It is easy to measure and correlated with or a proxy for stuff we do care about though. 

Tags 
nucleotide diversity, snp, snp data, snv 
Thread Tools  

