SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
New Resources for 1000 Genomes laura Bioinformatics 36 04-15-2017 09:48 AM
SPANNER (1000 genomes) Margarida Bioinformatics 11 12-04-2013 10:09 AM
1000 genomes - can anyone join in? henry.wood General 0 06-24-2011 05:21 AM
1000 Genomes Data RichardRocca General 1 03-16-2011 01:11 PM
1000 genomes Nataiki Bioinformatics 4 02-04-2011 05:42 AM

Reply
 
Thread Tools
Old 10-25-2011, 10:59 AM   #1
jlc_1020
Member
 
Location: Massachusetts

Join Date: Mar 2010
Posts: 11
Default 1000 genomes ancestral alleles?

Hello all,

I've downloaded a subset of SNPs using tabix and then did an allele count using vcftools. I used the -get-INFO AA to get the identity of the ancestral allele for my SNPs but it returned all question marks.

My question is, what is the next best way to get ancestral alleles from the 1000 genomes SNPs dataset. I've tried using UCSCs liftover to chimp and macaque, but the alleles don't always match up (i.e. sometimes the ancestral allele is neither one of the human variants).

Any help would be greatly appreciated.
jlc_1020 is offline   Reply With Quote
Old 10-26-2011, 03:40 AM   #2
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

The current 1000 genomes release does not contain ancestral alleles unfortunately

You can get ancestral alleles using the vcftools fill-aa script

http://vcftools.sourceforge.net/perl...e.html#fill-aa

and use the Human Ancestral sequence from

ftp://ftp.1000genomes.ebi.ac.uk/vol1...37_e59.tar.bz2

Which is based on a 6way EPO alignment from ensembl 59
laura is offline   Reply With Quote
Old 10-26-2011, 08:34 PM   #3
jlc_1020
Member
 
Location: Massachusetts

Join Date: Mar 2010
Posts: 11
Default

Thanks for the help. I think I've almost got this working now. But I'm still getting this error message when I run the fill-aa script:

FIXME: the sequence names not in '>(chr)?\S+' format [?BCuman_ancestor_2.fa.gz.fai
at //FaSlice.pm line 56
FaSlice::throw('FaSlice=HASH(0x10096cc40)', 'FIXME: the sequence names not in \'>(chr)?\S+\' format [\x{1f}\x{8b}\x{8}\x{4}\x{0}...') called at //FaSlice.pm line 92

I'm not sure what this means. The human ancestral sequences are paired with a .bed index file, but the fill-aa script looks for a .fa.gz.fai, so I made one with samtools by doing:

samtools faidx human_ancestor_2.fa | bgzip -c >human_ancestor_2.fa.gz.fai

Is the problem with the format of my index file?
jlc_1020 is offline   Reply With Quote
Old 10-27-2011, 01:29 AM   #4
Petr
Junior Member
 
Location: Europe

Join Date: Oct 2011
Posts: 6
Default

Hi,

I've just extended the documentation of fill-aa to answer this frequently asked question:

About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

Example:
# Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
# following command for each file manually
bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
samtools faidx human_ancestor_1.fa.gz

# .. or use this loop (tested in bash shell)
ls human_ancestor_*.fa.bz2 | while read IN; do
OUT=`echo $IN | sed 's,bz2$,gz,'`
CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
samtools faidx $OUT
done

# After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

# Now the files are ready to use with fill-aa. Note that the VCF file
# should be sorted (see vcf-sort), otherwise the performance would be seriously
# affected.
cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

Hope this helps,
Petr
Petr is offline   Reply With Quote
Old 10-27-2011, 02:42 PM   #5
jlc_1020
Member
 
Location: Massachusetts

Join Date: Mar 2010
Posts: 11
Default

That worked great, thanks for all your help!
jlc_1020 is offline   Reply With Quote
Old 11-01-2011, 01:02 PM   #6
jlc_1020
Member
 
Location: Massachusetts

Join Date: Mar 2010
Posts: 11
Default

Ok, so I've run into a new but related problem with this dataset. The human ancestral alignments above are based on the NCBI36 build of the human genome, while the May 2011 SNP calls for the 1000 genomes are based on GRCh37. So when I run fill-aa it fills in the wrong alleles.

Is there a newer human ancestral alignment that would work? Or some other known workaround for this discrepancy?
jlc_1020 is offline   Reply With Quote
Old 11-01-2011, 01:15 PM   #7
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Which ones are you using

The ones in here
ftp://ftp.1000genomes.ebi.ac.uk/vol1...ical/reference

Are from ensembl 59 which was based on grch37
laura is offline   Reply With Quote
Old 11-01-2011, 08:38 PM   #8
jlc_1020
Member
 
Location: Massachusetts

Join Date: Mar 2010
Posts: 11
Default

Ah, I see what I've done. I was using the ancestral files referred to in Petr's response, which were based on the NCBI36 build.

I've downloaded the correct files now, and then bzipped them so I could used Petr's shell script above to index them with samtools. Now everything seems to be working correctly; the ancestral alleles line up with the snp variants. Thanks!
jlc_1020 is offline   Reply With Quote
Old 05-31-2012, 12:04 PM   #9
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Quote:
Originally Posted by jlc_1020 View Post
Ah, I see what I've done. I was using the ancestral files referred to in Petr's response, which were based on the NCBI36 build.

I've downloaded the correct files now, and then bzipped them so I could used Petr's shell script above to index them with samtools. Now everything seems to be working correctly; the ancestral alleles line up with the snp variants. Thanks!
Is it possible for you to share the final solution, and where the ancestral allele sequence can be obtained from?

I know Petr provided most of the snippets, but if you can summarize the working version you got, would be tremendously helpful.

Also could you share how you are using the ancestral allele information. I was thinking of using it as in the LOF paper by macArthur et al.

Thanks!!
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 10-02-2012, 11:21 AM   #10
bpb9
Member
 
Location: NYC

Join Date: Aug 2012
Posts: 24
Default Anyone know what the "-" and "." mean in fill-aa?

I used the fill-aa feature in vcf tools to pull the ancestral alleles, according to thousand genomes, for variants I have typed in another population. However, I did not get an ancestral allele call for all of the positions. Most of the positions are biallelic SNPs, so the ancestral allele is listed A, T, G, or C as the case may be.

But some of the SNPs' ancestral alleles are filled in by a "-" or a "." or sometimes "-GT" of "C-A", etc. for variants bigger than one nucleotide. In some cases, if the variant is many base pairs, for example TGCAAT, the ancestral allele is filled in with "------" or "......".

I am not sure what "-" and "." means. If "." means reference, why isn't the reference allele just listed instead? Does "-" mean not present, or does it mean variable gap? I apologize if this is a silly question, but I couldn't find the details on the vcf tools page. Please advise if you know! Thanks!
bpb9 is offline   Reply With Quote
Old 10-02-2012, 12:03 PM   #11
bpb9
Member
 
Location: NYC

Join Date: Aug 2012
Posts: 24
Default

Nevermind, I've just answered my own question, I think. Clicking on Petr's link above (ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments), there is a readme.ancestralalignments txt file in the folder which says that "." means no ancestral information, while "-" means lineage-specific (not present in ancestor, only in human). Is this accurate?

If so, what does one do about all positions for which the ancestral is "."? It seems a waste to ignore these. Could some alternative approach be used to call an ancestral allele at these positions?

Last edited by bpb9; 10-02-2012 at 12:14 PM. Reason: typo
bpb9 is offline   Reply With Quote
Old 10-04-2012, 12:56 PM   #12
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

When no genome is complete and many of the genomes used in the alignments are of mixed quality this is unavoidable
laura is offline   Reply With Quote
Old 10-15-2012, 11:39 PM   #13
Petr
Junior Member
 
Location: Europe

Join Date: Oct 2011
Posts: 6
Default

Hi,
this README describes what dashes and dots mean
ftp://ftp.1000genomes.ebi.ac.uk/vol1...gnments/README
Petr is offline   Reply With Quote
Old 10-13-2017, 01:23 AM   #14
shrutijha
Junior Member
 
Location: India

Join Date: Oct 2017
Posts: 1
Default Error in Last step

Quote:
Originally Posted by Petr View Post
Hi,

I've just extended the documentation of fill-aa to answer this frequently asked question:

About: This script fills ancestral alleles into INFO column of VCF files. It depends on samtools, therefore the fasta sequence must be gzipped (not bgzipped!) and indexed by samtools faidx. The AA files can be downloaded from
ftp://ftp.1000genomes.ebi.ac.uk/vol1...ral_alignments
and processed as shown in the example below. This is because the sequences in the original files are named as 'ANCESTOR_for_chromosome:NCBI36:1:1:247249719', but the underlying FaSplice.pm requires names as 'chr1' or '1'.

Example:
# Get the files ready: compress by gzip and index by samtools faidx. Either repeat the
# following command for each file manually
bzcat human_ancestor_1.fa.bz2 | sed 's,^>.*,>1,' | gzip -c > human_ancestor_1.fa.gz
samtools faidx human_ancestor_1.fa.gz

# .. or use this loop (tested in bash shell)
ls human_ancestor_*.fa.bz2 | while read IN; do
OUT=`echo $IN | sed 's,bz2$,gz,'`
CHR=`echo $IN | sed 's,human_ancestor_,, ; s,.fa.bz2,,'`
bzcat $IN | sed "s,^>.*,>$CHR," | gzip -c > $OUT
samtools faidx $OUT
done

# After this has been done, the following command should return 'TACGTGGcTGCTCTCACACAT'
samtools faidx human_ancestor_1.fa.gz 1:1000000-1000020

# Now the files are ready to use with fill-aa. Note that the VCF file
# should be sorted (see vcf-sort), otherwise the performance would be seriously
# affected.
cat file.vcf | fill-aa -a human_ancestor_ 2>test.err | gzip -c >out.vcf.gz

Hope this helps,
Petr

I am trying to run:
Code:
cat chr3.vcf | fill-aa -a human_ancestor_3.fa.bz | gzip -c >out.sort.ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.geno                                                                                      types.vcf


But I am getting the following error:

[W::fai_fetch] Reference 3:60069-160069 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 3:60069-160069
The command "samtools faidx human_ancestor_3\.fa\.bz 3\:60069\-160069" returned non-zero status 256.
>3:60069-160069
at /usr/local/share/perl5/FaSlice.pm line 56.
FaSlice::throw('FaSlice=HASH(0x2227f08)', 'The command "samtools faidx h uman_ancestor_3\.fa\.bz 3\:60069...', '.\x{a}', '>3:60069-160069\x{a}') called a t /usr/local/share/perl5/FaSlice.pm line 79
FaSlice::cmd('FaSlice=HASH(0x2227f08)', 'samtools faidx human_ancestor_3 \.fa\.bz 3\:60069\-160069') called at /usr/local/share/perl5/FaSlice.pm line 125
FaSlice::read_chunk('FaSlice=HASH(0x2227f08)', 3, 60069) called at /usr/ local/share/perl5/FaSlice.pm line 153
FaSlice::get_base('FaSlice=HASH(0x2227f08)', 3, 60069) called at /usr/lo cal/bin/fill-aa line 148
main::fill_aa('HASH(0x1d52a68)', 'human_ancestor_3.fa.bz') called at /us r/local/bin/fill-aa line 18
shrutijha is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:10 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO