View Single Post
Old 10-27-2011, 12:29 AM   #4
Petr
Junior Member
 
Location: Europe

Join Date: Oct 2011
Posts: 6
Default

Hi,

I've just extended the documentation of fill-aa to answer this frequently asked question:

About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

Example:
# Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
# following command for each file manually
bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
samtools faidx human_ancestor_1.fa.gz

# .. or use this loop (tested in bash shell)
ls human_ancestor_*.fa.bz2 | while read IN; do
OUT=`echo $IN | sed 's,bz2$,gz,'`
CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
samtools faidx $OUT
done

# After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

# Now the files are ready to use with fill-aa. Note that the VCF file
# should be sorted (see vcf-sort), otherwise the performance would be seriously
# affected.
cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

Hope this helps,
Petr
Petr is offline   Reply With Quote