SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: Technology-specific error signatures in the 1000 Genomes Project data. Newsbot! Literature Watch 0 11-15-2011 11:10 AM
Extracting specific regions from binary .map file mhayes Bioinformatics 0 08-15-2011 01:14 PM
1000 genomes - can anyone join in? henry.wood General 0 06-24-2011 05:21 AM
1000 genomes Nataiki Bioinformatics 4 02-04-2011 05:42 AM
Extracting reads with specific barcodes agc Bioinformatics 2 01-05-2011 05:34 AM

Reply
 
Thread Tools
Old 12-21-2010, 09:34 AM   #1
maricu
Junior Member
 
Location: denmark

Join Date: Apr 2010
Posts: 8
Unhappy Extracting genome specific SNPs from 1000 genomes

Hello!

I've been trying to get SNP data from the 1000 genome project, I've been looking at the vcf files, however I fail to understand if these report population,individual, or total variation... I would like to download genotypes from specific genomes. I would appreciate any information.

Cheers
maricu is offline   Reply With Quote
Old 12-22-2010, 03:07 AM   #2
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Documentation of the format can be found here

http://vcftools.sourceforge.net/specs.html

The files provided by the 1000 genomes project generally represent all the variant sites discovered in the samples analysed. The most recent release contains a list of the samples analysed ftp://ftp.1000genomes.ebi.ac.uk/vol1...0804.ALL.panel

vcftools provides software which can provide subsets of data from a vcf file
The files are also indexed by tabix which means you can stream variants from a specific part of the genome
laura is offline   Reply With Quote
Old 12-22-2010, 03:09 AM   #3
maricu
Junior Member
 
Location: denmark

Join Date: Apr 2010
Posts: 8
Default

Thanks Laura!!!
Yes indeed, I found the vcftools, the sys admin will install soon and I will try it, but in the mean time I found a way to do it with awk, it works quite well!

M
maricu is offline   Reply With Quote
Old 12-22-2010, 03:45 AM   #4
Todd Johnson
Junior Member
 
Location: Yokohama, Japan

Join Date: Dec 2010
Posts: 3
Default

Laura-
Sorry to jump into someone else's thread, but you seem like an expert to whom I could ask this question. Have you tried running vcftools on the main November release genotype file ALL.2of4intersection.20100804.genotypes.vcf.gz?

If I uncompress the file and run:
vcftools --vcf ALL.2of4intersection.20100804.genotypes.vcf --chr 21 --out chr22 --recode

then VCFtools quits with "Error:Expected Number entry in INFO description..." The three INFO fields for EUR_R2, ASN_R2, and AFR_R2 are missing the "Number" entry. It seems like "Number=1" should be inserted between the field ID and the "Type=Float" tag, or else vcftools quits. I have a hard time believing that no one else has ran upon this problem, so I wonder if I'm doing something unusual? Anyway, I've modified my local copy and it works, but I thought that someone perhaps closer to the 1000 Genomes project would want to know.

Best wishes,

Todd
Todd Johnson is offline   Reply With Quote
Old 12-22-2010, 03:51 AM   #5
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

That does look to be an error in the headers

If you find problems like this it is best to email info@1000genomes.org so the right people can investigate

thanks for letting us know
laura is offline   Reply With Quote
Old 12-22-2010, 07:15 AM   #6
Todd Johnson
Junior Member
 
Location: Yokohama, Japan

Join Date: Dec 2010
Posts: 3
Default

Laura-
Sorry, I actually neglected to look under the "Project Contacts" link on the web-site. However, I did e-mail goncalo, since his e-mail is at the bottom of the README file for the latest release. Having not heard anything back from him, I thought that I should take the opportunity when I saw your message up above. Another thing I noticed, but don't know if it's expected, is that there are a number of rows that have no genotypes in any of the samples. I expect that many rows would be missing genotypes in one population or the other, but not across all samples. I suppose that those are variant sites that were found at BC and NCBI but did not have genotypes since they did not perform LD aware genotype analysis. It seems to me that those should be in the "sites" file but filtered out of the "genotypes" file. I'll put together an e-mail and forward my thoughts to the info@1000genomes.org e-mail.

Thanks!

Todd
Todd Johnson is offline   Reply With Quote
Old 12-22-2010, 07:21 AM   #7
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

It was decided it was better for all the sites to be in both files but those variants which don't have genotypes to get the ./. notation. The sites file is always meant to contain all the same variants as the genotype file but it is provided to give those who don't need individual genotypes a smaller download (300MB versus 60GB)

The only genotypes which should be used for imputation are those which include a prediction by BI as these are the only sites which have genotypes assigned in an LD aware manner. UMich genotyper isn't LD aware and imputation accuracy suffers if they are used for this purpose
laura is offline   Reply With Quote
Old 12-22-2010, 07:02 PM   #8
genesquared
Junior Member
 
Location: SF, CA

Join Date: Dec 2010
Posts: 6
Default

all individual genotypes = 60 GB data?!

Are you kidding me?

60 x 10^9 / 1000 = 60 x 10^6 = 60 Mb per person, sounds reasonable.
genesquared is offline   Reply With Quote
Old 12-22-2010, 07:50 PM   #9
Todd Johnson
Junior Member
 
Location: Yokohama, Japan

Join Date: Dec 2010
Posts: 3
Default

Tell me about it!

The VCF file has so much other information besides just the genotype calls, that it seems a bit excessive for a release to the public. It's sort of like XML imbedded in a table format. A header at the top, and key value pairs embedded within columns.
A representative single variant position call data for one sample looks like this!:
0|0:3,0:3:.:-0.00,-0.90,-13.33:22.58:./.

To understand the format a bit better, take a look at http://www.1000genomes.org/wiki/Anal...mat-version-40

If someone wants just genotype calls, you can download files formatted for Beagle, MACH, and Impute, which are much much smaller, but it seems to me that each of those formats leaves out some of the info that would be useful for checking allele orientation (i.e, between existing Build 36 Illumina 610k data and the release's Build 37 coordinates):

Beagle:
http://faculty.washington.edu/browni...le/beagle.html

MACH:
http://www.sph.umich.edu/csg/abecasis/MaCH

Impute:
https://mathgen.stats.ox.ac.uk/impute/impute_v2.html
Todd Johnson is offline   Reply With Quote
Old 12-23-2010, 12:30 AM   #10
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Quote:
Originally Posted by genesquared View Post
all individual genotypes = 60 GB data?!

Are you kidding me?

60 x 10^9 / 1000 = 60 x 10^6 = 60 Mb per person, sounds reasonable.
Well its only 629 individuals in this instance and its 60GB compressed, 380GB uncompressed but you should generally be able to stream the file using a combination of tabix and or zcat so you never need to uncompress it properly
laura is offline   Reply With Quote
Old 12-23-2010, 02:39 PM   #11
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Quote:
Originally Posted by Todd Johnson View Post
Laura-
Sorry to jump into someone else's thread, but you seem like an expert to whom I could ask this question. Have you tried running vcftools on the main November release genotype file ALL.2of4intersection.20100804.genotypes.vcf.gz?

If I uncompress the file and run:
vcftools --vcf ALL.2of4intersection.20100804.genotypes.vcf --chr 21 --out chr22 --recode

then VCFtools quits with "Error:Expected Number entry in INFO description..." The three INFO fields for EUR_R2, ASN_R2, and AFR_R2 are missing the "Number" entry. It seems like "Number=1" should be inserted between the field ID and the "Type=Float" tag, or else vcftools quits. I have a hard time believing that no one else has ran upon this problem, so I wonder if I'm doing something unusual? Anyway, I've modified my local copy and it works, but I thought that someone perhaps closer to the 1000 Genomes project would want to know.

Best wishes,

Todd
This error should of now been fixed

thanks for pointing it out
laura is offline   Reply With Quote
Old 01-21-2011, 03:16 AM   #12
genesquared
Junior Member
 
Location: SF, CA

Join Date: Dec 2010
Posts: 6
Default

I would like to inspect 17 individuals' and about 300 SNPs in a 500 kb loci.

Is there any "short cut"?

I know their hg18 position (but no rs#).

Thanks in advance
genesquared is offline   Reply With Quote
Old 01-21-2011, 03:46 AM   #13
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Your best bet for this it to use tabix to extract the data from the released vcf files.

The vcf format is described here
http://vcftools.sourceforge.net/specs.html

The files themselves can be found here
ftp://ftp.1000genomes.ebi.ac.uk/vol1...man_variation/

You can use tabix http://sourceforge.net/projects/samtools/files/tabix/ to extract subsections of these files

e.g

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz 1:10000:20000
laura is offline   Reply With Quote
Reply

Tags
1000 genomes, genotyping

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO