View Single Post
Old 10-27-2011, 01:29 AM   #4
Junior Member
Location: Europe

Join Date: Oct 2011
Posts: 6


I've just extended the documentation of fill-aa to answer this frequently asked question:

About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying requires names as 'chr1' or '1'.

# Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
# following command for each file manually
bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
samtools faidx human_ancestor_1.fa.gz

# .. or use this loop (tested in bash shell)
ls human_ancestor_*.fa.bz2 | while read IN; do
OUT=`echo $IN | sed 's,bz2$,gz,'`
CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
samtools faidx $OUT

# After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

# Now the files are ready to use with fill-aa. Note that the VCF file
# should be sorted (see vcf-sort), otherwise the performance would be seriously
# affected.
cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

Hope this helps,
Petr is offline   Reply With Quote