Firebird 06-16-2010 06:44 AM

1000 Genomes Data/ Exon targetted

I have a question concerning the 1000 Genomes Data. On the ftp they have lowcoverage and exontargetted data.
I assume that the exontagetted files only contain sequence information of the exons, but with a higher coverage. Is that correct?
But why is the filesize between the individuals (exon tagetted, same chromosome) so different in size.


rbagnall 08-03-2010 03:31 PM


I think the 1000 genomes project have enriched and sequenced only 1000 genes in the pilot data. I am trying to find out which 1000 genes they have enriched, but this simple piece of data is frustratingly hard to find.

Can anyone else help?

adamdeluca 08-03-2010 05:47 PM

There is a bed file of the targeted regions and a gene list. Both labeled P3.

adamdeluca 08-03-2010 05:51 PM

Also, There are three pilot projects.

P1 is low coverage-whole genome
P2 is sequencing of parent/ child trios
P3 is a sequence capture of coding exons of 1000 genes

rbagnall 08-03-2010 06:20 PM


Wonderful. That's just what I wanted.

Thanks Adamdeluca

BetterPrimate 08-03-2010 11:19 PM

OK summarising...

Pilot1 = 2 - 4X coverage 180 samples Whole-genome sequencing
Pilot2 = 20-60X coverage 6 samples(2 trios) Whole-genome sequencing
Pilot3 = 50X coverage 900 samples 1000 genes seqenced
Main project= 4X coverage 2000 samples Whole genome sequence.

But the FTP data is most unwieldly with separate VCF files per population listing every genotype for every individual. Which raises a question:

Is there somewhere that summarises the allele frequencies for SNPs across all the 1KG pilots and combines the populations?
e.g. In pilot3 data for the CEU population we can find SNP rs61733845 has 122 alleles called but if you look up that SNP in dbSNP there is no frequency data.

Firebird 08-04-2010 11:55 AM


I was also looking for an overall VCF file, but I could only find genotypes per population per pilot study.

An overall files for the whole project would be fine.

laura 08-09-2010 02:10 AM

At the moment the project ftp doesnt provide overall files for all the variants calls

You can get the vcf files for each sub population used in each pilot from

low coverage represents 180 individuals sequencings to 2-4x
trios represents 2 family trios sequenced to 30x+
exon represents ~700 individuals sequence for 1000 genes

You could use the vcftools sourceforge package to get your frequencies for the whole set

The perl code that is part of this package will merge vcf files for you

and the c++ code will provide frequency reports

Firebird 08-13-2010 01:37 AM


how can I access the data from the 2000 Individuals sequenced with a 4 x coverage.


laura 08-13-2010 01:52 AM

Not all 2500 individuals have been sequenced yet.

So far we have sequence data for 653 samples, 552 have more than 10GB of sequence data available in fastq format

We have alignments for 539 individuals in bam format

You can get all this data from our ftp site

Our website explains how our ftp site is structured

Firebird 08-13-2010 03:22 AM

Did you also call variants from this 653 samples?

Btw. I have a question about you called variants in the pilot 1 study. Did I undestand it right, that you pooled all the low coverage sequence data and called the variants from this new data set? Don't you loose very rare variants by doing this?

laura 08-13-2010 04:00 AM

There aren't any variants released yet on the main project data.

We had a release of variants on the pilot data in july which you can find here

As far as the variant calling goes as most of the low coverage individuals only have between 2 and 4x coverage there is insufficient data to call most variants just from one individual to the pooling of data gains us power. The low coverage approach is less powerful for rare variants

Firebird 08-13-2010 05:39 AM

Can you please tell me how many individuals are included in the last release?

So with this approach you are only able to call common variants? But isn't it a goal of the project to detect variants with a frequency of less than 1 %?

laura 08-13-2010 05:46 AM

If you look at the alignment index and sequence index files on the ftp site you can see how many individuals are in each release.

With 2500 individuals we can get 95% of 1%MAF alleles in the accessible genome. We will find some variants with lower MAF but we won't find all of them.

This project is designed to find all shared variation within the population rather very rare variants

Another phase of the project is going to do exome sequencing of the 2500 individuals and these will hopefully get variants down to 0.1% in these regions as we will have higher coverage of those regions

dilly.desilva 01-13-2011 04:58 AM

Allele frequencies in subpopulations 628 individuals

I am aware that there is a vcf file "ALL.2of4intersection.20100804.sites.vcf.gz" on the ftp site where you can retrieve allele frequency for SNPs from the low coverage data of 628 individuals. This is pooled across all subpopulations.

Is there a way I can get the allele frequencies for the same SNPs in subpopulations?

laura 01-13-2011 05:01 AM

There are no precalculated AFs per sub population. You can calculate AN and AC numbers for each sub population and use that to work out an AF though

dilly.desilva 01-13-2011 06:31 AM

Thank you Laura,

Where would I get the AC and AN for the separate subpopulations of the 628 individuals from? Its not on the merged SNP set is it?

andrehorta 01-13-2011 06:38 AM


I'm newer whith genome. I need yours help. I was downloaded a sequence_read from 1000 genome project (, and i sow two foldres, alignment and sequence_read. Wich this folders has a genome? And what's the diference about fastq, fasta, sra and ers? Wich this is genome?

laura 01-13-2011 06:56 AM

The sequence_read dir contains the raw sequence reads that have been produced for a particular individual these are in fastq format.
The alignment dir contains alignment files in bam format which aligns the raw reads to a reference genome (in this case GRCh37).

There is more information about this data


laura 01-13-2011 06:59 AM


I am afraid you will have to calculate that yourself. The population for each sample is described in


