Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • From VCF to Fst

    Hello,

    Anyone here tackled with the problem of calculating Fst measures for different populations while having the variants of every single population in a VCF file?

    That's sorta the stage I am at. I have a reference genome, I mapped reads from different populations to it, called variants with Freebayes, and now not sure how to construct a phylogeny or calculate Fst.

    Anyone?

    Thank you.

  • #2
    If you are looking for "outlier" SNPs, I like to use BayeScan. Otherwise, vcftools calculates Fst in windows or globally and works nicely.

    Comment


    • #3
      For Fst, you need intra pop variation and inter pop variation. If I have a vcf files showing intrapop, how do I feed it other VCF files? for interpop?

      The tool looks promising, can even calculate TajimasD, but when I look at it, all D seem to equal -nan so not sure what that means

      Comment


      • #4
        So, it's been a minute since I've used vcftools fst, but I'll try to remember what to do. I'm not sure what you mean by "If I have a vcf files showing intrapop" -- does this mean that you have one vcf per pop? For vcftools, you can feed it a vcf with ALL individuals from ALL pops, and then you tell it which pop is which. For example, pop1_members.pop is simply a list of the individuals in the vcf separated by newlines.

        For example:
        Code:
        vcftools --vcf two_pops.vcf --weir-fst-pop pop1_members.pop --weir-fst-pop pop2_members.pop
        vcftools is great for many things and it does a lot of stuff. One note of caution about Tajima's D is that vcftools assumes you are doing full-genome resequencing, not subsequencing like RAD. Not sure what you're using, but that's something I recently discovered the hard way.

        Comment


        • #5
          Well, I am doing full genomes re sequencing. The problem is that my population, is a population of spores... 800,000 spores... so I can't list them all. And my VCF files contains variants, of those 800,000 spores. then I have a few other VCF files containg variants of additional distinct populations of spores.

          Is there any way to do stats on this? LD, TajimasD, Fst? I have been looking for an answer to this for months now... everything is based on sequence alignment...

          Comment


          • #6
            Oh my. Did you sequence the spores as a pool, then? Or do you really have 800k individual sequences? Do you have allele frequencies? I'm pretty sure there is an answer to your dilemma, because Fst, LD and TajD/pi are based off frequencies, correct?

            Comment


            • #7
              Yes these are all about allele frequencies, and my sample is pooled, there is no other way to do it since they are unicellular. The VCF have all alleles and all the allelic frequencies..

              Can you think of any way to feed this into Fst? How are your individuals tagged in your VCF files?

              Comment


              • #8
                I'm not familiar with any prepackaged calculators that use frequencies directly, though I'm sure they exist. However, it must not be terribly difficult to calculate a first-pass value on your own?

                Maybe check out this website -- http://johnhawks.net/explainer/laboratory/measuring-fst . It has step-by-step directions on how to do it based on the frequencies that you already have.

                However, what's your goal? Are you looking for a single Fst value per population, or per some-sized window?

                Comment


                • #9
                  Finally a normal example I can understand.

                  That would be per population, since I have re sequencing data.

                  Comment


                  • #10
                    Okay, that's much easier. By walking along the chromosome, calculating Fst and averaging over total bases I think you will escape the problems I am discovering in my locus-by-locus RAD-based pop genetics project. I'm struggling with developing a null distribution to which to compare my statistics, multiple test corrections and sliding windows. Ugh.

                    I'll think some more about your other stats, but I'm pretty sure it's doable.
                    Good luck!

                    Comment


                    • #11
                      Yeah... my species is likely tetraploid, so heterozygosity is not 2pq, but a more complicated version of that.

                      Also, is it me, or does Fst assume HWE? For me, most allele frequencies are equal to each other, when I have 2 alleles thay are 50%/50%, suggesting all individuals to be heterozygous, rather than only half of them, as shown by 2pq.

                      Comment


                      • #12
                        Actually, 50%/50% says nothing about the actual *genotypes*: you could have either have 100 Aa dudes (which would be weird) or you could have 50 AA and 50 aa individuals, or 25 AA, 25 aa and 50 Aa... etc etc

                        Luckily for you most statistics (excluding, obviously, heterozygosity) only care about the allelic frequencies and not the genotypes themselves.

                        ...Does that help, or did I misread your question?

                        Comment


                        • #13
                          rcapper, this paper calculates Fst using RAD-Seq. Are there issues with the approach used or is your situation different?



                          Originally posted by rcapper View Post
                          Okay, that's much easier. By walking along the chromosome, calculating Fst and averaging over total bases I think you will escape the problems I am discovering in my locus-by-locus RAD-based pop genetics project. I'm struggling with developing a null distribution to which to compare my statistics, multiple test corrections and sliding windows. Ugh.

                          I'll think some more about your other stats, but I'm pretty sure it's doable.
                          Good luck!
                          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                          Comment


                          • #14
                            Oh, yes, I'm quite familiar with that paper. I've already used BayeScan for my Fst calculations because I am interested in the outlier loci mainly, but I am planning to script my own Fst calculator based on the Hohenlohe et al. 2010 weighted formula as well to see if I "missed" any. But, because that's a little redundant at the moment, I'm working on other stats first.

                            The issues I'm having are not so much calculating the stats in the first place, but in scripting the generation of the null distribution such that I can compare those stats to the expectation under neutrality. I think I'll write up a new thread about that, actually, because I haven't seen much discussion about the pros and cons of different strategies.
                            Last edited by rcapper; 11-25-2013, 03:50 PM. Reason: added link

                            Comment


                            • #15
                              Okay so you are right, in principle, we can have AA Aa and aa in any ratios. However, I know by comparing these allele frequencies between samples, that most alleles are present in all of my samples, are 50/50, in all samples, suggesting that this is simple heterozygosity and that most individuals have the same genotype, with very little variation within my population.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X