Seqanswers Leaderboard Ad

**laura** · 10-26-2011, 02:40 AM

The current 1000 genomes release does not contain ancestral alleles unfortunately

You can get ancestral alleles using the vcftools fill-aa script

VCFtools: Perl tools and API

http://vcftools.sourceforge.net/perl_module.html#fill-aa

and use the Human Ancestral sequence from

ftp://ftp.1000genomes.ebi.ac.uk/vol1...37_e59.tar.bz2

Which is based on a 6way EPO alignment from ensembl 59

**jlc_1020** · 10-26-2011, 07:34 PM

Thanks for the help. I think I've almost got this working now. But I'm still getting this error message when I run the fill-aa script:

FIXME: the sequence names not in '>(chr)?\S+' format [?BCuman_ancestor_2.fa.gz.fai
at //FaSlice.pm line 56
FaSlice::throw('FaSlice=HASH(0x10096cc40)', 'FIXME: the sequence names not in \'>(chr)?\S+\' format [\x{1f}\x{8b}\x{8}\x{4}\x{0}...') called at //FaSlice.pm line 92

I'm not sure what this means. The human ancestral sequences are paired with a .bed index file, but the fill-aa script looks for a .fa.gz.fai, so I made one with samtools by doing:

samtools faidx human_ancestor_2.fa | bgzip -c >human_ancestor_2.fa.gz.fai

Is the problem with the format of my index file?

**Petr** · 10-27-2011, 12:29 AM

Hi,

I've just extended the documentation of fill-aa to answer this frequently asked question:

About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

Example:
# Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
# following command for each file manually
bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
samtools faidx human_ancestor_1.fa.gz

# .. or use this loop (tested in bash shell)
ls human_ancestor_*.fa.bz2 | while read IN; do
OUT=`echo $IN | sed 's,bz2$,gz,'`
CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
samtools faidx $OUT
done

# After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

# Now the files are ready to use with fill-aa. Note that the VCF file
# should be sorted (see vcf-sort), otherwise the performance would be seriously
# affected.
cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

Hope this helps,
Petr

**jlc_1020** · 10-27-2011, 01:42 PM

That worked great, thanks for all your help!

**jlc_1020** · 11-01-2011, 12:02 PM

Ok, so I've run into a new but related problem with this dataset. The human ancestral alignments above are based on the NCBI36 build of the human genome, while the May 2011 SNP calls for the 1000 genomes are based on GRCh37. So when I run fill-aa it fills in the wrong alleles.

Is there a newer human ancestral alignment that would work? Or some other known workaround for this discrepancy?

**laura** · 11-01-2011, 12:15 PM

Which ones are you using

The ones in here
ftp://ftp.1000genomes.ebi.ac.uk/vol1...ical/reference

Are from ensembl 59 which was based on grch37

**jlc_1020** · 11-01-2011, 07:38 PM

Ah, I see what I've done. I was using the ancestral files referred to in Petr's response, which were based on the NCBI36 build.

I've downloaded the correct files now, and then bzipped them so I could used Petr's shell script above to index them with samtools. Now everything seems to be working correctly; the ancestral alleles line up with the snp variants. Thanks!

**bioinfosm** · 05-31-2012, 11:04 AM

Originally posted by jlc_1020 View Post

Ah, I see what I've done. I was using the ancestral files referred to in Petr's response, which were based on the NCBI36 build.

I've downloaded the correct files now, and then bzipped them so I could used Petr's shell script above to index them with samtools. Now everything seems to be working correctly; the ancestral alleles line up with the snp variants. Thanks!

Is it possible for you to share the final solution, and where the ancestral allele sequence can be obtained from?

I know Petr provided most of the snippets, but if you can summarize the working version you got, would be tremendously helpful.

Also could you share how you are using the ancestral allele information. I was thinking of using it as in the LOF paper by macArthur et al.

Thanks!!

**bpb9** · 10-02-2012, 10:21 AM

Anyone know what the "-" and "." mean in fill-aa?

I used the fill-aa feature in vcf tools to pull the ancestral alleles, according to thousand genomes, for variants I have typed in another population. However, I did not get an ancestral allele call for all of the positions. Most of the positions are biallelic SNPs, so the ancestral allele is listed A, T, G, or C as the case may be.

But some of the SNPs' ancestral alleles are filled in by a "-" or a "." or sometimes "-GT" of "C-A", etc. for variants bigger than one nucleotide. In some cases, if the variant is many base pairs, for example TGCAAT, the ancestral allele is filled in with "------" or "......".

I am not sure what "-" and "." means. If "." means reference, why isn't the reference allele just listed instead? Does "-" mean not present, or does it mean variable gap? I apologize if this is a silly question, but I couldn't find the details on the vcf tools page. Please advise if you know! Thanks!

**bpb9** · 10-02-2012, 11:03 AM

Nevermind, I've just answered my own question, I think. Clicking on Petr's link above (ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments), there is a readme.ancestralalignments txt file in the folder which says that "." means no ancestral information, while "-" means lineage-specific (not present in ancestor, only in human). Is this accurate?

If so, what does one do about all positions for which the ancestral is "."? It seems a waste to ignore these. Could some alternative approach be used to call an ancestral allele at these positions?

**laura** · 10-04-2012, 11:56 AM

When no genome is complete and many of the genomes used in the alignments are of mixed quality this is unavoidable

**Petr** · 10-15-2012, 10:39 PM

Hi,
this README describes what dashes and dots mean
ftp://ftp.1000genomes.ebi.ac.uk/vol1...gnments/README

**shrutijha** · 10-13-2017, 12:23 AM

Error in Last step

Originally posted by Petr View Post

Hi,

I've just extended the documentation of fill-aa to answer this frequently asked question:

About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

Example:
# Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
# following command for each file manually
bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
samtools faidx human_ancestor_1.fa.gz

# .. or use this loop (tested in bash shell)
ls human_ancestor_*.fa.bz2 | while read IN; do
OUT=`echo $IN | sed 's,bz2$,gz,'`
CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
samtools faidx $OUT
done

# After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

# Now the files are ready to use with fill-aa. Note that the VCF file
# should be sorted (see vcf-sort), otherwise the performance would be seriously
# affected.
cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

Hope this helps,
Petr

I am trying to run:

Code:

cat chr3.vcf | fill-aa -a human_ancestor_3.fa.bz | gzip -c >out.sort.ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.geno types.vcf

But I am getting the following error:

[W::fai_fetch] Reference 3:60069-160069 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 3:60069-160069
The command "samtools faidx human_ancestor_3\.fa\.bz 3\:60069\-160069" returned non-zero status 256.
>3:60069-160069
at /usr/local/share/perl5/FaSlice.pm line 56.
FaSlice::throw('FaSlice=HASH(0x2227f08)', 'The command "samtools faidx h uman_ancestor_3\.fa\.bz 3\:60069...', '.\x{a}', '>3:60069-160069\x{a}') called a t /usr/local/share/perl5/FaSlice.pm line 79
FaSlice::cmd('FaSlice=HASH(0x2227f08)', 'samtools faidx human_ancestor_3 \.fa\.bz 3\:60069\-160069') called at /usr/local/share/perl5/FaSlice.pm line 125
FaSlice::read_chunk('FaSlice=HASH(0x2227f08)', 3, 60069) called at /usr/ local/share/perl5/FaSlice.pm line 153
FaSlice::get_base('FaSlice=HASH(0x2227f08)', 3, 60069) called at /usr/lo cal/bin/fill-aa line 148
main::fill_aa('HASH(0x1d52a68)', 'human_ancestor_3.fa.bz') called at /us r/local/bin/fill-aa line 18

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 26 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

1000 genomes ancestral alleles?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News