SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to calculate nucleotide diversity (pi and theta) using NextGen Data? Genomics101 Bioinformatics 0 08-20-2013 12:06 PM
How to calculate the concordance of NGS calls to SNP array data yfwang Bioinformatics 1 04-30-2013 09:58 PM
The way to normalize RNA-SEQ coverage data for multiple samples? puggie RNA Sequencing 8 04-09-2013 04:23 AM
How do I extract partial sequence data (Fasta) from multiple hits in NCBI nucleotide? andtill Bioinformatics 1 11-09-2012 01:49 PM
SNP base calling for multiple samples shuang Bioinformatics 2 09-07-2011 03:06 PM

Reply
 
Thread Tools
Old 09-12-2013, 01:32 AM   #1
Sonderkar
Junior Member
 
Location: Denmarrk, Aalborg

Join Date: Jan 2009
Posts: 3
Default How to calculate nucleotide diversity from SNP data from multiple samples

Hi all,
I have a collection of SNP data from a tetraploid population, and I would like to calculate the nucleotide diversity. The generel forumula is:


where xi and xj are the respective frequencies of the ith and jth sequences, (PI)ij is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample.
This is fairly simple if we have sequence data (and not SNP data).
If we have these 3 sequences in a (very small) population, and we have sampled them each 1 time, we need to compare 1vs2, 1vs3, and 2vs3

>1
ATGCGTTTTT
>2
ATGGGTTTTT
>3
ATGCGTTTTA

We now get:
(PI)12 = 1/10 (they differ at pos. 4)
(PI)13 = 1/10 (they differ at pos. 10)
(PI)23 = 2/10 (they differ at pos. 4 and 10)
n = 3
The respective frequencies are always 1/3 in this case this means that:

PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) )
=>
PI = 2*(1/3^2)*(1/10+1/10+2/10)
=>
PI = (2/9)*(2/5)
=>
PI = 4/45 = 0.0888

I have a (VERY small popoluation of 2), and I have measured 2 SNPs:

Individual 1: a G/C SNP at position 3.
In
dividual 2: a T/A SNP at position 10.
Lets say the reference sequence is:
>ref
ATGCGTTTTT

Q1: Now I want to calculate the nucleotide diversity, would it then be correct to state that I have actually 4 sequences (4 alleles), meaning that I can calculate PI as follows:

Based on the SNP information:
>1 Individual 1 - allele 1
ATGCGTTTTT
>2
Individual 1 - allele 2
ATCCGTTTTT
>
3 Individual 2 - allele 1
ATGCGTTTTT
>4 Individual 2 - allele 2
ATGCGTTTTA
We now get:
(PI)12 = 1/10 (they differ at pos. 3)
(PI)13 = 0 (they do not differ)
(PI)14 = 1/10 (they differ at pos. 10)
(PI)23 = 1/10 (they differ at pos. 3)
(PI)24 =
2/10 (they differ at pos. 3 and 10)
(PI)34 =
1/10 (they differ at pos. 10)
n = 4
The respective frequencies are always 1/4, and can be moved out of the parenthesis and squared (see above). in this case PI is:

PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) )
=>
PI = 2*(1/4^2)*(1/10+0+1/10+1/10+2/10+1/10)
=>
PI = (2/16)*(5/10)
=>
PI = 1/16

The question is whether this splitting of SNP information into alleles, and hence sequences is correct?

Now, as I stated I have a tetraploid population, and I therefore plan to do the same, but expand each SNP into 4 alleles:

I have a (VERY small population of 2), and I have measured 2 SNPs:

Individual 1: a G/G/C/C SNP at position 3.
In
dividual 2: a T/T/T/A SNP at position 10.
Lets say the reference sequence is still:
>ref
ATGCGTTTTT

Based on the SNP information:
>1 Individual 1 - allele 1
ATGCGTTTTT
>2
Individual 1 - allele 2
ATGCGTTTTT
>
3 Individual 1 - allele 3
ATCCGTTTTT
>4 Individual 1 - allele 4
ATCCGTTTTT
>5 Individual 2 - allele 1
ATGCGTTTTT
>6
Individual 2 - allele 2
ATGCGTTTTT
>7
Individual 2 - allele 3
ATGCGTTTTT
>8 Individual 2 - allele 4
ATGCGTTTTA
n = 8
(PI)12 = 0 (they do not differ)
(PI)13 =
1/10 (they differ at pos. 3)
(PI)14 =
1/10 (they differ at pos. 3)
(PI)15 = 0 (they do not differ)
(PI)16 = 0 (they do not differ)
(PI)17 = 0 (they do not differ)
(PI)18 = 1/10 (they differ at pos. 10)

(PI)23 =
1/10 (they differ at pos. 3)
(PI)24 = 1/10 (they differ at pos. 3)
(PI)25 = 0 (they do not differ)
(PI)26 = 0 (they do not differ)
(PI)27 = 0 (they do not differ)
(PI)28 = 1/10 (they differ at pos. 10)

(PI)34 =
0 (they do not differ)
(PI)35 =
1/10 (they differ at pos. 3)
(PI)36 =
1/10 (they differ at pos. 3)
(PI)37 =
1/10 (they differ at pos. 3)
(PI)38 = 2/10 (they differ at pos. 3 and 10)

(PI)45 = 1/10 (they differ at pos. 3)
(PI)46 =
1/10 (they differ at pos. 3)
(PI)47 =
1/10 (they differ at pos. 3)
(PI)48 = 2/10 (they differ at pos. 3 and 10)

(PI)56 =
0 (they do not differ)
(PI)57 = 0 (they do not differ)
(PI)58 = 1/10 (they differ at pos. 10)

(PI)67 = 0 (they do not differ)
(PI)68 = 1/10 (they differ at pos. 10)


(PI)78 = 1/10 (they differ at pos. 10)

Now, the respective frequencies are always 1/8, and can be moved out of the parenthesis and squared (see above). in this case PI is:

SUM( PI(XJ) ) = 19/10

PI = 2* ( (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(1/10) + (1/3)*(1/3)*(2/10) )
=>
PI = 2*(1/8^2)*(19/10)
=>
PI = (1/32)*(19/10)
=>
PI = 19/320 = 0.059375

The question is (again) whether this splitting of SNP information into alleles, and hence sequences is correct?
Sonderkar is offline   Reply With Quote
Old 11-12-2013, 02:50 AM   #2
bondsergy
Junior Member
 
Location: Italy

Join Date: Nov 2013
Posts: 1
Default

I think there is a mistake in your formula for pi, you are missing a factor n/n-1, which is negligible when n is large and often omitted, but this is not your case. The nucleotide diversity in your 1st example should be equal to 2/15 (not 4/45, total 4 differences out of 30 comparisons=4/30=2/15).
bondsergy is offline   Reply With Quote
Old 11-19-2013, 11:34 AM   #3
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

I am doing a very similar analysis. However, my SNP data is based off RAD data and the number of individuals genotyped at any given base may change (i.e., the number of "chromosomes" considered per each site). Therefore, using the n/(n-1) factor can't work for my data; instead, I am going through each base and calculating pairwise differences, then averaging over my arbitrary window size.

For example, if I have the following five sites, the first two may only be genotyped for four 'chromosomes' and the last three for six.

site: 1 2 3 4 5
chr1 A A T G C
chr2 A A T G C
chr3 A A T C C
chr4 A T T C C
chr5 [ ][ ]T C G
chr6 [ ][ ]A C G

site 1: 0 pairwise differences (4 chr sampled)
site 2: 3/6 differences (4 chr sampled)
site 3: 5/15 (6 chr sampled)
site 4: 8/15 (6 sampled)
site 5: 8/15 (6 sampled)

pi = 0.38 for this five-base window.

Another thing that I figured out while working with pi and RAD data is that you have to be really careful if you're using a vcf file! vcfs obviously (usually) only contain variants; however, pi must consider the number of GENOTYPED bases, INCLUDING invariants! Other tools such as vcftools only use the pi positional data and assume all bases not included in the vcf are invariants. This will give you a biased estimate.

As far as your specific questions, I think that yes, you are on the right track. I don't think that pi cares if your data is diploid or tetraploid; it simply samples the number of pairwise differences among chromosomes considered, correct?

The only real difference between my example here and yours above are that you are considering differences among stretches of bases and I am walking along the chromosome. I think each method should give you the same answer as long as you have the same 'n' along the stretch of bases (and, importantly, you use the n/(n-1) factor as mentioned by bondsergy). I don't, therefore the base-by-base works best for me.

You might also find this blog post useful:
http://binhe.org/2011/12/29/calculat...mation-method/

I'm not sure this will help you, but I'd like to get it out on the forums in case it helps someone.
rcapper is offline   Reply With Quote
Old 02-23-2014, 11:48 PM   #4
travc
Junior Member
 
Location: Davis, CA

Join Date: Aug 2013
Posts: 4
Default

I thought I'd offer an improvement.
sum( 2j(n-j) / n(n-1) ) only works for bi-allelic loci, but generalizing is easy.

If you take a list of allele counts (Ji values) INCLUDING THE MAJOR ALLELE,
sum( sum_i( Ji (n-Ji) ) / (n(n-1)) ) ) / L
where L is the number of sites/positions compared.
(BTW: sum_i is read as "sum over i", and Ji should really be "J sub i")

I haven't rigorously done the proof, but it intuitively makes sense and agrees with the value you get by exhaustively looking at all the pair values. So I'm pretty certain it is correct.

Ex:
3 A, 2 C, 2 G
= (12+10+10)/42 = 16/21

Some python testing code:
Code:
#!/usr/bin/python
from collections import Counter
import itertools

def calcPi(seq_list, transposed=True):
    """
    seq_list is an iteratable of iterables containing alleles by [position][seq_number]
    if transposed=False,
        then seq_list is by [seq_number][position]
        this format is slightly less efficient
    """
    if( transposed == False ):
        seq_list = itertools.izip(*seq_list)
    num_sites = 0
    sum_pi = 0
    for allele_list in seq_list:
        counts = Counter(allele_list)
        n = sum(counts.values())
        num_sites += 1
        sum_pi += sum(( j*(n-j) for j in counts.values() )) / (n*(n-1.0))
    return sum_pi / float(num_sites) # sequence pi is just mean of per-position pi values

# Test it!
import random
import time

LEN = 1000
NUM_SEQ = 10
NUM_ITERATIONS = 10
NUCLEOTIDES = ['A','T','C','G']
NUCLEOTIDES.extend(['A']*8) # over-represent 'A' to make sequences more similar
#random.seed(0)
CMP_EPSILON = 1e-10 # how close do results have to be to call them equal

tics = 0
exhaustive_tics = 0

for _ in range(NUM_ITERATIONS):
    # generate NUM_SEQ random sequences of length LEN (trasposed so it is by [position][seq_number]
    At = [ [ random.choice(NUCLEOTIDES) for _ in range(NUM_SEQ) ] for _ in range(LEN) ]

    tic = time.clock()
    seq_pi = calcPi(At, transposed=True)
    tics += time.clock()-tic

    # exhaustive (all-pairs) per-sequence method
    A = map(list, itertools.izip(*At)) # transpose At so it is [seq_num][pos]
    tic = time.clock()
    pair_pi = []
    for i in range(len(A)):
        for j in range(i):
            t = 0
            d = 0
            for k in range(len(A[i])):
                t += 1
                if( A[i][k] != A[j][k] ):
                    d += 1
            pair_pi.append( 2.0*((1.0/float(len(A)))**2)*(d/float(t)) )
            #print i,j, A[i],A[j], d,t, pair_pi[-1]
    seq_pi_exhaustive = (len(A)/(len(A)-1.0))*sum(pair_pi)
    exhaustive_tics += time.clock()-tic

    # output
    if( abs(seq_pi-seq_pi_exhaustive) > CMP_EPSILON ):
        print "MISMATCH", seq_pi,'!=', seq_pi_exhaustive

print NUM_ITERATIONS, "cycles"
print "calcuation took",tics
print "exhaustive took",exhaustive_tics
travc is offline   Reply With Quote
Old 03-30-2014, 05:22 AM   #5
irma42
Junior Member
 
Location: Indonesia

Join Date: Mar 2014
Posts: 1
Smile Adenin base insertion

Hi All,

Iam Irma, new comer from Bogor Agriculture University, Indonesia. I have 180 DNA sequence samples from duck population. When i analyze the data, why insertion of Adenin base always uniform in all of DNA samples? Another case is, when there are C to A mutation, in another place we found A deletion. Its like an equilibrium to balancq A composition. Any one who can explain this phenomenon? Thank you All
irma42 is offline   Reply With Quote
Old 09-09-2014, 03:32 AM   #6
Dmelnet
Junior Member
 
Location: Barcelona, Spain

Join Date: Sep 2014
Posts: 5
Default unbiased nucleotide diversity estimation from SNP data

Thanks for all your useful discussion about the topic!
The only doubt I have regarding your formulas is a small discrepancy in the denominator of the base pair summation method.
Both Travis Collier and Bin He propose the following formula: 2j(n-j)/(n(n-1)) here and in the blog entry mentioned above (http://binhe.org/2011/12/29/calculat...mation-method/).
However, Begun et al., 2007 uses a formula that can be transformed to 2j(n-j)/(n(n-j)) (if I didn't make any mistake in the process...). I don't understand where this discrepancy comes from and which formula is the correct one...
Another point that is not clear for me is if j reffers to (i)the count of the minor allele, (ii) the count of the alternative allele (either major or minor) or (iii) the count of the derived allele (if ancestral state is available).

I could not find any publication explicitly showing the applicability of the base pair summation method for estimating pi... Could you find any? or have you published your results so that I can cite you?
The purpose of my analysis is to accurately estimate nucleotide diversity from a cohort of 43 genomes (in fact, two populations of 16 and 27 genomes).

Thank you very much!

Begun's paper: (http://www.plosbiology.org/article/i...bio.0050310#s3)
Dmelnet is offline   Reply With Quote
Old 09-16-2014, 01:01 AM   #7
dschika
Member
 
Location: Zurich

Join Date: Mar 2010
Posts: 56
Default

Hi Dmelnet,

I think there is a typo in the Begun paper. If you compare the first part of the formular with the second, I think it should be 2j(n-j)/(n(n-1)).

Quote:
I could not find any publication explicitly showing the applicability of the base pair summation method for estimating pi... Could you find any? or have you published your results so that I can cite you?
This is something I'm also interested in!
dschika is offline   Reply With Quote
Old 09-16-2014, 02:43 AM   #8
Dmelnet
Junior Member
 
Location: Barcelona, Spain

Join Date: Sep 2014
Posts: 5
Default typo found

Thank you dschika!

I wanted to use vcf tools for calculating pi but I have realized that, not only treates missing positions in the vcf file as invariant sites but it also treats missing genotypes as invariant sites (hom ref)! I think the correct way would be using the total number of genotyped alleles per position instead of just the total number of alleles per position (genotyped or missing data).
Do you know if this is already implemented in any other tool?

I will have a look at this: "PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R" (http://mbe.oxfordjournals.org/conten.../molbev.msu136) and I will let you know. If you have experience with it, share it also!
Dmelnet is offline   Reply With Quote
Old 09-16-2014, 03:57 AM   #9
dschika
Member
 
Location: Zurich

Join Date: Mar 2010
Posts: 56
Default

I also use vcftools a lot, but, because of the reasons you mentioned, it is not a good solution for pi calculation in this case.
Unfortunately, I have no experience with PopGenome.
I also searched a bit, but could not find an existing tool for this task.

I implemented now the Begun method (with c(c-1)) in the denominator of the nominator of the pi formula) in a python script using a vcf file and a 'coverage' file generated with bedtools for each individual. Those files report the coverage of each position to a given reference. From the bedtools files I can calculate how many bases in total were covered by c individuals and from the vcf file I get the variants with coverage c and the number of j (equals the number of 1's) in it.

With this approach, I do not use missing data or uncovered positions.
dschika is offline   Reply With Quote
Old 09-16-2014, 04:54 AM   #10
Dmelnet
Junior Member
 
Location: Barcelona, Spain

Join Date: Sep 2014
Posts: 5
Red face Me too!

I cannot belive it! I have done exactly the same in perl!!! Well... a slightly different implementation calculating a pi per site that is stored in the INFO field of the vcf. Then, it can be used to calculate a pi per window (summation of pi per site and then division by the number of positions analysed).
I was afraid I was wasting my time if this was already implemented somewhere else... but I have specific filtering criteria for the vcf calls and I also want to calculate other statistics such as Fst, watterson's theta, and all that stuff... and I have to finish all this in a couple of weeks!
Dmelnet is offline   Reply With Quote
Old 09-16-2014, 05:19 AM   #11
dschika
Member
 
Location: Zurich

Join Date: Mar 2010
Posts: 56
Default


Sounds good!

What exactly is your divisor in
Quote:
(summation of pi per site and then division by the number of positions analysed)
?

The number of variant bases or a window size of e.g. 1000 bp?
If you divide by a certain window size: Do you adjust somewhere for different coverages in the non-variant sites?

This was one of my problems, as I had different coverages over my sample and could not assume that all non-variant sites had same coverage. I mean, if I have a position that is covered by 5 of 20 individuals, I cannot assume that the remaining 15 individuals all show the reference nucleotide at that position. Therefore, I used the Begun method which takes coverages into account.
dschika is offline   Reply With Quote
Old 09-16-2014, 06:35 AM   #12
Dmelnet
Junior Member
 
Location: Barcelona, Spain

Join Date: Sep 2014
Posts: 5
Default

my calculation of pi per site is the following:

sub PixSite(){
my $c=$_[0];
my $j=$_[1];
my $pi=0;
$pi=(2*$j*($c-$j))/($c*($c-1));
return $pi;
}

where c is the number of covered positions (e.g 15 out of 20 individuals would give 30 alleles at this site, it corresponds to AN field in GATK INFO column of your vcf file) and j is the number of alternate alleles (AC field in the GATK INFO column of your vcf file, e.g if 5 individuals are homozygous for the alternate allele, j would be 10).

This comes from a small rearrangement of Begun's formula and from the definition itself.

Once I have a pi per site adjusted for different coverages, I can calculate average pi per window.

For doing this, I count the number of analysed positions in the window of 1000bp that pass certain filtering criteria taking into account both variant and invariant sites (e.g 990 out of 1000bp have good coverage in 5 or more genomes, and are not part of repetitive elements, and and and...). Then I divide the summatory of pi per site by the number of analysed positions in the window (990).
It is not necessary to adjust for the heterogeneous coverage in the invariant sites because pi in those sites is 0 and it doesn't matter the value of the denominator c(c-1).

I hope it helps!

By the way, are you working on drosophila?
Dmelnet is offline   Reply With Quote
Old 09-16-2014, 07:39 AM   #13
dschika
Member
 
Location: Zurich

Join Date: Mar 2010
Posts: 56
Default

Quote:
It is not necessary to adjust for the heterogeneous coverage in the invariant sites because pi in those sites is 0 and it doesn't matter the value of the denominator c(c-1).
Yes, pi is zero in those regions. But the coverages of invariant sites are important (as far as I understood it), as the coverage of the invariant sites influences the total amount of possible pairwise comparisions. Looking again at the denominator of the Begun method, they explain that:
Quote:
n_c is the number of aligned base pairs in the genomic region
which I interpreted to be coverage depended. Maybe I miss something here?

Quote:
By the way, are you working on drosophila?
I work with plants, mainly a conifer species and crops.

Have you published your complete script already somewhere? Or do you plan to do so?
dschika is offline   Reply With Quote
Old 09-16-2014, 07:47 AM   #14
Dmelnet
Junior Member
 
Location: Barcelona, Spain

Join Date: Sep 2014
Posts: 5
Default

If I interpreted correctly the formula, you can expand the summatories and reorganize them such that you get what I posted above, where the coverage in invariant sites does not influence the final result. I did the calculations by hand for a few examples and it worked...
When I finish editing the code I can upload it somewhere so that other people can use it (at their own risk). The paper still needs a lot of work to get published but hopefully it will be available in a mid-term future ;-)

I will leave the lab where I am currently doing research on December and I will defend my thesis on July 2015, so I will try to have something more or less finished by then...
Dmelnet is offline   Reply With Quote
Old 01-21-2015, 02:37 PM   #15
jsweagley
Junior Member
 
Location: Minneapolis

Join Date: Jan 2015
Posts: 1
Default

I have used the formula suggested by Travis Collier to obtain Pi values on a per-site basis from a vcf in Python. I now want to filter the sites and calculate mean and median pi values in windows, similar to what Dmelnet did. I was wondering if anybody has any suggestions about how to efficiently obtain these windowed results. I am new to programming and bioinformatics, so your help is greatly appreciated.

Also, I don't know if anyone here is analyzing multiple populations, or would like to calculate Dxy, but I think this formula works using the variables stored for the pi calculations and will give you per-site Dxy between two populations.
((n1*n2)-sum_i(J1_i*J2_i))/(n1*n2)
for population 1 and population 2 where J_i is the allele count (# of 0,1,2, or 3's) at a given site in the vcf for either population.
Code:
   if n_seq_POP_1 > 0 and n_seq_POP_2 >0 :
	POP_1_POP_2_dxy=((n_seq_POP_1 * n_seq_POP_2)- ((j0_POP_1 * j0_POP_2)+(j1_POP_1 * j1_POP_2)+(j2_POP_1 * j2_POP_2)+(j3_POP_1*j3_POP_2)))/(n_seq_POP_1*n_seq_POP_2)
Thank you,
Jimmy
jsweagley is offline   Reply With Quote
Old 06-11-2015, 01:31 AM   #16
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 497
Default

How do we calculate nucleotide diversity when there are pairwise insertions/deletions ?
ymc is offline   Reply With Quote
Old 06-11-2015, 02:13 AM   #17
travc
Junior Member
 
Location: Davis, CA

Join Date: Aug 2013
Posts: 4
Default

Quote:
Originally Posted by ymc View Post
How do we calculate nucleotide diversity when there are pairwise insertions/deletions ?
I think the most standard approach is to just ignore them (only use biallelic SNPs). This seem good enough for most purposes.

I'm not a all sure how one would properly incorporate them. I suppose it depends on what exactly you are trying to really measure.
One simple approach would be to treat both SNPs and indels as generic markers/alleles. You wouldn't be calculating "nucleotide diversity" as such, more like "allelic diversity". One potential complexity is the different mutation rates, which could mess up estimating stuff like rates of divergence using the metric.

BTW: No one really cares about nucleotide diversity... It is easy to measure and correlated with or a proxy for stuff we do care about though.
travc is offline   Reply With Quote
Reply

Tags
nucleotide diversity, snp, snp data, snv

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:35 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO