SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
multiple vcf files to one multisampled vcf file Jetse Bioinformatics 2 06-27-2013 05:34 AM
Tools to generate VCF from two FASTA, or mutant FASTA from Ref FASTA and VCF? jeffseq Bioinformatics 3 05-28-2013 10:59 AM
vcf-tools vcf-stats sample question Rubal7 Bioinformatics 1 04-09-2012 12:42 AM
How to get list of column in vcf file using Vcf.pm? jessada Bioinformatics 0 01-20-2012 07:22 AM
VCFtools Vcf.pm problem - broken VCF header on 1000genomes data naumenko.sa Bioinformatics 1 07-08-2011 04:17 AM

Reply
 
Thread Tools
Old 10-15-2013, 12:20 PM   #1
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default From VCF to Fst

Hello,

Anyone here tackled with the problem of calculating Fst measures for different populations while having the variants of every single population in a VCF file?

That's sorta the stage I am at. I have a reference genome, I mapped reads from different populations to it, called variants with Freebayes, and now not sure how to construct a phylogeny or calculate Fst.

Anyone?

Thank you.
AdrianP is offline   Reply With Quote
Old 11-25-2013, 12:58 PM   #2
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

If you are looking for "outlier" SNPs, I like to use BayeScan. Otherwise, vcftools calculates Fst in windows or globally and works nicely.
rcapper is offline   Reply With Quote
Old 11-25-2013, 01:11 PM   #3
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

For Fst, you need intra pop variation and inter pop variation. If I have a vcf files showing intrapop, how do I feed it other VCF files? for interpop?

The tool looks promising, can even calculate TajimasD, but when I look at it, all D seem to equal -nan so not sure what that means
AdrianP is offline   Reply With Quote
Old 11-25-2013, 01:24 PM   #4
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

So, it's been a minute since I've used vcftools fst, but I'll try to remember what to do. I'm not sure what you mean by "If I have a vcf files showing intrapop" -- does this mean that you have one vcf per pop? For vcftools, you can feed it a vcf with ALL individuals from ALL pops, and then you tell it which pop is which. For example, pop1_members.pop is simply a list of the individuals in the vcf separated by newlines.

For example:
Code:
vcftools --vcf two_pops.vcf --weir-fst-pop pop1_members.pop --weir-fst-pop pop2_members.pop
vcftools is great for many things and it does a lot of stuff. One note of caution about Tajima's D is that vcftools assumes you are doing full-genome resequencing, not subsequencing like RAD. Not sure what you're using, but that's something I recently discovered the hard way.
rcapper is offline   Reply With Quote
Old 11-25-2013, 01:29 PM   #5
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Well, I am doing full genomes re sequencing. The problem is that my population, is a population of spores... 800,000 spores... so I can't list them all. And my VCF files contains variants, of those 800,000 spores. then I have a few other VCF files containg variants of additional distinct populations of spores.

Is there any way to do stats on this? LD, TajimasD, Fst? I have been looking for an answer to this for months now... everything is based on sequence alignment...
AdrianP is offline   Reply With Quote
Old 11-25-2013, 01:41 PM   #6
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

Oh my. Did you sequence the spores as a pool, then? Or do you really have 800k individual sequences? Do you have allele frequencies? I'm pretty sure there is an answer to your dilemma, because Fst, LD and TajD/pi are based off frequencies, correct?
rcapper is offline   Reply With Quote
Old 11-25-2013, 01:44 PM   #7
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Yes these are all about allele frequencies, and my sample is pooled, there is no other way to do it since they are unicellular. The VCF have all alleles and all the allelic frequencies..

Can you think of any way to feed this into Fst? How are your individuals tagged in your VCF files?
AdrianP is offline   Reply With Quote
Old 11-25-2013, 01:57 PM   #8
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

I'm not familiar with any prepackaged calculators that use frequencies directly, though I'm sure they exist. However, it must not be terribly difficult to calculate a first-pass value on your own?

Maybe check out this website -- http://johnhawks.net/explainer/laboratory/measuring-fst . It has step-by-step directions on how to do it based on the frequencies that you already have.

However, what's your goal? Are you looking for a single Fst value per population, or per some-sized window?
rcapper is offline   Reply With Quote
Old 11-25-2013, 01:58 PM   #9
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Finally a normal example I can understand.

That would be per population, since I have re sequencing data.
AdrianP is offline   Reply With Quote
Old 11-25-2013, 02:03 PM   #10
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

Okay, that's much easier. By walking along the chromosome, calculating Fst and averaging over total bases I think you will escape the problems I am discovering in my locus-by-locus RAD-based pop genetics project. I'm struggling with developing a null distribution to which to compare my statistics, multiple test corrections and sliding windows. Ugh.

I'll think some more about your other stats, but I'm pretty sure it's doable.
Good luck!
rcapper is offline   Reply With Quote
Old 11-25-2013, 02:09 PM   #11
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Yeah... my species is likely tetraploid, so heterozygosity is not 2pq, but a more complicated version of that.

Also, is it me, or does Fst assume HWE? For me, most allele frequencies are equal to each other, when I have 2 alleles thay are 50%/50%, suggesting all individuals to be heterozygous, rather than only half of them, as shown by 2pq.
AdrianP is offline   Reply With Quote
Old 11-25-2013, 02:13 PM   #12
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

Actually, 50%/50% says nothing about the actual *genotypes*: you could have either have 100 Aa dudes (which would be weird) or you could have 50 AA and 50 aa individuals, or 25 AA, 25 aa and 50 Aa... etc etc

Luckily for you most statistics (excluding, obviously, heterozygosity) only care about the allelic frequencies and not the genotypes themselves.

...Does that help, or did I misread your question?
rcapper is offline   Reply With Quote
Old 11-25-2013, 02:15 PM   #13
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

rcapper, this paper calculates Fst using RAD-Seq. Are there issues with the approach used or is your situation different?
http://www.plosgenetics.org/article/...l.pgen.1000862


Quote:
Originally Posted by rcapper View Post
Okay, that's much easier. By walking along the chromosome, calculating Fst and averaging over total bases I think you will escape the problems I am discovering in my locus-by-locus RAD-based pop genetics project. I'm struggling with developing a null distribution to which to compare my statistics, multiple test corrections and sliding windows. Ugh.

I'll think some more about your other stats, but I'm pretty sure it's doable.
Good luck!
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 11-25-2013, 02:24 PM   #14
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

Oh, yes, I'm quite familiar with that paper. I've already used BayeScan for my Fst calculations because I am interested in the outlier loci mainly, but I am planning to script my own Fst calculator based on the Hohenlohe et al. 2010 weighted formula as well to see if I "missed" any. But, because that's a little redundant at the moment, I'm working on other stats first.

The issues I'm having are not so much calculating the stats in the first place, but in scripting the generation of the null distribution such that I can compare those stats to the expectation under neutrality. I think I'll write up a new thread about that, actually, because I haven't seen much discussion about the pros and cons of different strategies.

Last edited by rcapper; 11-25-2013 at 02:50 PM. Reason: added link
rcapper is offline   Reply With Quote
Old 11-25-2013, 03:26 PM   #15
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Okay so you are right, in principle, we can have AA Aa and aa in any ratios. However, I know by comparing these allele frequencies between samples, that most alleles are present in all of my samples, are 50/50, in all samples, suggesting that this is simple heterozygosity and that most individuals have the same genotype, with very little variation within my population.
AdrianP is offline   Reply With Quote
Old 11-25-2013, 03:29 PM   #16
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

That could be but I'm pretty sure you can't say that with any confidence. You can go from allele freqs to HWE expected genotypes but that's about it... And you can't test departure from HWE without your actual genotypes (like, not the pooled samples).
rcapper is offline   Reply With Quote
Old 11-25-2013, 03:37 PM   #17
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

60% of loci between all samples are shared, means out of all loci that are variable in any of my populations, 60% will be found in any other population. When you look at how the allele frequency varies, between the different populations, it slightly varies, meaning if it's 49/51 in one populations, it might be 45/55 in the other and 41/59 in the 3rd and 48/52 in the 4th and so on and so forth. I plotted it, it doesn't vary much. This means intra individual variation, not intra population variation.

So you are right, I can't do anything with HWE. Which makes me wonder, can I actually calculate Fst for this.... What if I treat the entire spore population as 1 individual? I can calculate the variation within the population as if it was within that 1 individual, and then between the 6 populations?
AdrianP is offline   Reply With Quote
Old 11-25-2013, 04:47 PM   #18
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

Not sure that I understand what you mean. So 60% of all of your loci are invariants? Or they're only variant in a single population? If that's true, as in, 4 pops are 100% 'A' and the fifth pop is 75% 'A', then that's actually okay and depending on the magnitude of the difference, that might be what you're even looking for.

It totally depends on your experiment, but you might have to demonstrate that 49/51 is statistically different from 48/52, etc etc. Those sound like the same frequencies to me...

I'm also not sure what you mean by "This means intra individual variation, not intra population variation." Sure, the within-pop freqs could be totally different (ex., 100 AAaa tetraploids in one pop, but 50 AAAA and 50 aaaa in the other) but you really can't tell those two situations apart from the data you have.

Yes, you can still calc Fst -- but if your freqs within each pop are the same as the freqs among each pops then Fst will be around 0 (= no differentiation between pops.)

I don't think treating the pop as one individual makes sense, but if it helps you conceptualize it then I guess go for it...
rcapper is offline   Reply With Quote
Old 11-25-2013, 04:50 PM   #19
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

For 60%, the allele frequencies are the same.

However, there are population specific alleles. These should raise the Fst, right?
AdrianP is offline   Reply With Quote
Old 03-08-2016, 08:59 AM   #20
clarissaboschi
Member
 
Location: US

Join Date: Apr 2010
Posts: 63
Default

I have a doubt about Fst from VCFtools.

I used VCFtools to obtain Fst, but VCFtools generates 2 different Fst: mean and weighted.

Which one is best to consider? In my cases these 2 Fst are very different. I asked the same question in the VCFtools forum and I had no reply. I think the weighted is better but I had very high values in this one.

thanks
Clarissa
clarissaboschi is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO