Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Haplotype frequencies

    I am interested in looking at some very short regions (~200 bp) in the human genome which contain ~15 SNPs. So, taking for instance Phase 3 phased haplotype data (vcf) from 1000 genomes (~2500 individuals), I would like to identify all the different haplotypes and count them, thus allowing me to obtain the haplotype frequencies in this sample of individuals.

    I have tried using 'vcfgeno2haplo' and 'vcf2tsv' which are part of vcflib on Galaxy but I cannot the first to accept my data; can someone suggest how I might do this or where I should look?

    There's an interesting tool described at
    Background To understand individual genomes it is necessary to look at the variations that lead to changes in phenotype and possibly to disease. However, genotype information alone is often not sufficient and additional knowledge regarding the phase of the variation is needed to make correct interpretations. Interactive visualizations, that allow the user to explore the data in various ways, can be of great assistance in the process of making well informed decisions. But, currently there is a lack for visualizations that are able to deal with phased haplotype data. Results We present inPHAP, an interactive visualization tool for genotype and phased haplotype data. inPHAP features a variety of interaction possibilities such as zooming, sorting, filtering and aggregation of rows in order to explore patterns hidden in large genetic data sets. As a proof of concept, we apply inPHAP to the phased haplotype data set of Phase 1 of the 1000 Genomes Project. Thereby, inPHAP’s ability to show genetic variations on the population as well as on the individuals level is demonstrated for several disease related loci. Conclusions As of today, inPHAP is the only visual analytical tool that allows the user to explore unphased and phased haplotype data interactively. Due to its highly scalable design, inPHAP can be applied to large datasets with up to 100 GB of data, enabling users to visualize even large scale input data. inPHAP closes the gap between common visualization tools for unphased genotype data and introduces several new features, such as the visualization of phased data. inPHAP is available for download at http://bit.ly/1iJgKmX .

    for visualizing this information but it does not enable one to pull out the statistical data that I need.

    Thanks.

  • #2
    I am trying to answer the same question but on a genome scale. What are the haplotype frequencies at each locus across all loci.
    Plink --blocks will generate lists of haplotype blocks but there is only one block per locus which implies a maximum of two haplotypes per locus in a population which cannot always be true.
    Impute and Shapeit will phase genotype data but I have not found any sets of haplotype blocks assoicated with them so even if their models contemplate more than two haplotypes at a locus it is not possible to estract this information from the phased data.
    Any ideas about how I should find the number and frequencies of haploytpes at each locus in the 1000 genomes data?

    Comment


    • #3
      A colleague of mine should supply the code. But I don't understand what you mean by "at each locus across all loci"; what is your "locus" or what sort of "marker" are you considering?

      Comment


      • #4
        Hi Mrw3288, thanks for your reply.
        I mean SNP loci. I will be working with 1000 genomes data. I want to know the number of haplotypes at each SNP locus in the genome in each population in the data set.
        However locus can be a flexible concept and I could work with windows of say 10kb.

        Comment


        • #5
          So next what do you mean by "at each SNP locus"? You need at least two (adjacent) SNPs (along a chromosome) to start making a haplotype.
          Anyway, if I solve my question (200 bp window) I should be able to solve yours (10 kb window), and I'll let you know if/when that happens.

          Comment


          • #6
            True, but haplotypes have boundaries at SNP so the number of haplotypes associated with one SNP may be different from the adjacent SNP. The resolution that I will work to will partly depend on speed and practicality.
            I will look forward to hearing of your solution.

            Comment


            • #7
              Here (attached) is a partial solution; it's an R script written by my colleague Jaqueline Wang. Ignore the comments in Portuguese. Put the script and your *.vcf.gz (eg sliced from 1000genomes) file in the same directory and run the script. The output is a *.tsv file. You'll have to slice the vcf according to your needs.
              Attached Files

              Comment


              • #8
                Thanks a lot for that. I will let you know how I get on.

                Comment


                • #9
                  Originally posted by mrw3288 View Post
                  Here (attached) is a partial solution; it's an R script written by my colleague Jaqueline Wang. Ignore the comments in Portuguese. Put the script and your *.vcf.gz (eg sliced from 1000genomes) file in the same directory and run the script. The output is a *.tsv file. You'll have to slice the vcf according to your needs.
                  Hi! I would be very interested in this script, but it is not available for download now. Would you be able to share it again? Thank you!

                  Comment


                  • #10
                    Originally posted by Sofia T View Post
                    Hi! I would be very interested in this script, but it is not available for download now. Would you be able to share it again? Thank you!
                    Hi. The project is finished. Try contacting Jaqueline Wang using [email protected] who may be able to provide you with the script.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 08:47 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    59 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    54 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X