Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 1000 genomes ancestral alleles?

    Hello all,

    I've downloaded a subset of SNPs using tabix and then did an allele count using vcftools. I used the -get-INFO AA to get the identity of the ancestral allele for my SNPs but it returned all question marks.

    My question is, what is the next best way to get ancestral alleles from the 1000 genomes SNPs dataset. I've tried using UCSCs liftover to chimp and macaque, but the alleles don't always match up (i.e. sometimes the ancestral allele is neither one of the human variants).

    Any help would be greatly appreciated.

  • #2
    The current 1000 genomes release does not contain ancestral alleles unfortunately

    You can get ancestral alleles using the vcftools fill-aa script



    and use the Human Ancestral sequence from

    ftp://ftp.1000genomes.ebi.ac.uk/vol1...37_e59.tar.bz2

    Which is based on a 6way EPO alignment from ensembl 59

    Comment


    • #3
      Thanks for the help. I think I've almost got this working now. But I'm still getting this error message when I run the fill-aa script:

      FIXME: the sequence names not in '>(chr)?\S+' format [?BCuman_ancestor_2.fa.gz.fai
      at //FaSlice.pm line 56
      FaSlice::throw('FaSlice=HASH(0x10096cc40)', 'FIXME: the sequence names not in \'>(chr)?\S+\' format [\x{1f}\x{8b}\x{8}\x{4}\x{0}...') called at //FaSlice.pm line 92

      I'm not sure what this means. The human ancestral sequences are paired with a .bed index file, but the fill-aa script looks for a .fa.gz.fai, so I made one with samtools by doing:

      samtools faidx human_ancestor_2.fa | bgzip -c >human_ancestor_2.fa.gz.fai

      Is the problem with the format of my index file?

      Comment


      • #4
        Hi,

        I've just extended the documentation of fill-aa to answer this frequently asked question:

        About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
        ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
        and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

        Example:
        # Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
        # following command for each file manually
        bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
        samtools faidx human_ancestor_1.fa.gz

        # .. or use this loop (tested in bash shell)
        ls human_ancestor_*.fa.bz2 | while read IN; do
        OUT=`echo $IN | sed 's,bz2$,gz,'`
        CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
        bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
        samtools faidx $OUT
        done

        # After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
        samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

        # Now the files are ready to use with fill-aa. Note that the VCF file
        # should be sorted (see vcf-sort), otherwise the performance would be seriously
        # affected.
        cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

        Hope this helps,
        Petr

        Comment


        • #5
          That worked great, thanks for all your help!

          Comment


          • #6
            Ok, so I've run into a new but related problem with this dataset. The human ancestral alignments above are based on the NCBI36 build of the human genome, while the May 2011 SNP calls for the 1000 genomes are based on GRCh37. So when I run fill-aa it fills in the wrong alleles.

            Is there a newer human ancestral alignment that would work? Or some other known workaround for this discrepancy?

            Comment


            • #7
              Which ones are you using

              The ones in here
              ftp://ftp.1000genomes.ebi.ac.uk/vol1...ical/reference

              Are from ensembl 59 which was based on grch37

              Comment


              • #8
                Ah, I see what I've done. I was using the ancestral files referred to in Petr's response, which were based on the NCBI36 build.

                I've downloaded the correct files now, and then bzipped them so I could used Petr's shell script above to index them with samtools. Now everything seems to be working correctly; the ancestral alleles line up with the snp variants. Thanks!

                Comment


                • #9
                  Originally posted by jlc_1020 View Post
                  Ah, I see what I've done. I was using the ancestral files referred to in Petr's response, which were based on the NCBI36 build.

                  I've downloaded the correct files now, and then bzipped them so I could used Petr's shell script above to index them with samtools. Now everything seems to be working correctly; the ancestral alleles line up with the snp variants. Thanks!
                  Is it possible for you to share the final solution, and where the ancestral allele sequence can be obtained from?

                  I know Petr provided most of the snippets, but if you can summarize the working version you got, would be tremendously helpful.

                  Also could you share how you are using the ancestral allele information. I was thinking of using it as in the LOF paper by macArthur et al.

                  Thanks!!
                  --
                  bioinfosm

                  Comment


                  • #10
                    Anyone know what the "-" and "." mean in fill-aa?

                    I used the fill-aa feature in vcf tools to pull the ancestral alleles, according to thousand genomes, for variants I have typed in another population. However, I did not get an ancestral allele call for all of the positions. Most of the positions are biallelic SNPs, so the ancestral allele is listed A, T, G, or C as the case may be.

                    But some of the SNPs' ancestral alleles are filled in by a "-" or a "." or sometimes "-GT" of "C-A", etc. for variants bigger than one nucleotide. In some cases, if the variant is many base pairs, for example TGCAAT, the ancestral allele is filled in with "------" or "......".

                    I am not sure what "-" and "." means. If "." means reference, why isn't the reference allele just listed instead? Does "-" mean not present, or does it mean variable gap? I apologize if this is a silly question, but I couldn't find the details on the vcf tools page. Please advise if you know! Thanks!

                    Comment


                    • #11
                      Nevermind, I've just answered my own question, I think. Clicking on Petr's link above (ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments), there is a readme.ancestralalignments txt file in the folder which says that "." means no ancestral information, while "-" means lineage-specific (not present in ancestor, only in human). Is this accurate?

                      If so, what does one do about all positions for which the ancestral is "."? It seems a waste to ignore these. Could some alternative approach be used to call an ancestral allele at these positions?
                      Last edited by bpb9; 10-02-2012, 11:14 AM. Reason: typo

                      Comment


                      • #12
                        When no genome is complete and many of the genomes used in the alignments are of mixed quality this is unavoidable

                        Comment


                        • #13
                          Hi,
                          this README describes what dashes and dots mean
                          ftp://ftp.1000genomes.ebi.ac.uk/vol1...gnments/README

                          Comment


                          • #14
                            Error in Last step

                            Originally posted by Petr View Post
                            Hi,

                            I've just extended the documentation of fill-aa to answer this frequently asked question:

                            About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
                            ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
                            and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

                            Example:
                            # Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
                            # following command for each file manually
                            bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
                            samtools faidx human_ancestor_1.fa.gz

                            # .. or use this loop (tested in bash shell)
                            ls human_ancestor_*.fa.bz2 | while read IN; do
                            OUT=`echo $IN | sed 's,bz2$,gz,'`
                            CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
                            bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
                            samtools faidx $OUT
                            done

                            # After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
                            samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

                            # Now the files are ready to use with fill-aa. Note that the VCF file
                            # should be sorted (see vcf-sort), otherwise the performance would be seriously
                            # affected.
                            cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

                            Hope this helps,
                            Petr

                            I am trying to run:
                            Code:
                            cat chr3.vcf | fill-aa -a human_ancestor_3.fa.bz | gzip -c >out.sort.ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.geno                                                                                      types.vcf


                            But I am getting the following error:

                            [W::fai_fetch] Reference 3:60069-160069 not found in FASTA file, returning empty sequence
                            Failed to fetch sequence in 3:60069-160069
                            The command "samtools faidx human_ancestor_3\.fa\.bz 3\:60069\-160069" returned non-zero status 256.
                            >3:60069-160069
                            at /usr/local/share/perl5/FaSlice.pm line 56.
                            FaSlice::throw('FaSlice=HASH(0x2227f08)', 'The command "samtools faidx h uman_ancestor_3\.fa\.bz 3\:60069...', '.\x{a}', '>3:60069-160069\x{a}') called a t /usr/local/share/perl5/FaSlice.pm line 79
                            FaSlice::cmd('FaSlice=HASH(0x2227f08)', 'samtools faidx human_ancestor_3 \.fa\.bz 3\:60069\-160069') called at /usr/local/share/perl5/FaSlice.pm line 125
                            FaSlice::read_chunk('FaSlice=HASH(0x2227f08)', 3, 60069) called at /usr/ local/share/perl5/FaSlice.pm line 153
                            FaSlice::get_base('FaSlice=HASH(0x2227f08)', 3, 60069) called at /usr/lo cal/bin/fill-aa line 148
                            main::fill_aa('HASH(0x1d52a68)', 'human_ancestor_3.fa.bz') called at /us r/local/bin/fill-aa line 18

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM
                            • seqadmin
                              The Impact of AI in Genomic Medicine
                              by seqadmin



                              Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                              02-26-2024, 02:07 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-14-2024, 06:13 AM
                            0 responses
                            32 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-08-2024, 08:03 AM
                            0 responses
                            71 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-07-2024, 08:13 AM
                            0 responses
                            80 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-06-2024, 09:51 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X