SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
tabix and 1000 genomes data Alessandra Bioinformatics 8 09-05-2013 01:13 AM
Interpreting 1000 Genomes data ashkot Bioinformatics 5 01-05-2012 01:27 AM
Annotating 1000 Genomes data ashkot Bioinformatics 4 12-22-2011 11:53 AM
1000 Genomes Data RichardRocca General 1 03-16-2011 01:11 PM
need 1000 genomes data for just one gene michelle.lupton Bioinformatics 11 08-09-2010 02:00 PM

Reply
 
Thread Tools
Old 06-16-2010, 07:44 AM   #1
Firebird
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 18
Default 1000 Genomes Data/ Exon targetted

Hi,

I have a question concerning the 1000 Genomes Data. On the ftp they have lowcoverage and exontargetted data.
I assume that the exontagetted files only contain sequence information of the exons, but with a higher coverage. Is that correct?
But why is the filesize between the individuals (exon tagetted, same chromosome) so different in size.

Thanks
Firebird is offline   Reply With Quote
Old 08-03-2010, 04:31 PM   #2
rbagnall
Member
 
Location: Sydney, Australia

Join Date: Jun 2010
Posts: 34
Default

hi,

I think the 1000 genomes project have enriched and sequenced only 1000 genes in the pilot data. I am trying to find out which 1000 genes they have enriched, but this simple piece of data is frustratingly hard to find.

Can anyone else help?
rbagnall is offline   Reply With Quote
Old 08-03-2010, 06:47 PM   #3
adamdeluca
Member
 
Location: Iowa City, IA

Join Date: Jul 2010
Posts: 95
Default

ftp://ftp.1000genomes.ebi.ac.uk/vol1...cal/reference/

There is a bed file of the targeted regions and a gene list. Both labeled P3.
adamdeluca is offline   Reply With Quote
Old 08-03-2010, 06:51 PM   #4
adamdeluca
Member
 
Location: Iowa City, IA

Join Date: Jul 2010
Posts: 95
Default There are three pilots

Also, There are three pilot projects.

P1 is low coverage-whole genome
P2 is sequencing of parent/ child trios
P3 is a sequence capture of coding exons of 1000 genes

http://www.1000genomes.org/page.php?...#ProjectDesign
adamdeluca is offline   Reply With Quote
Old 08-03-2010, 07:20 PM   #5
rbagnall
Member
 
Location: Sydney, Australia

Join Date: Jun 2010
Posts: 34
Default

OOhh,

Wonderful. That's just what I wanted.

Thanks Adamdeluca
rbagnall is offline   Reply With Quote
Old 08-04-2010, 12:19 AM   #6
BetterPrimate
Member
 
Location: NSW

Join Date: May 2010
Posts: 15
Default

OK summarising...

Pilot1 = 2 - 4X coverage 180 samples Whole-genome sequencing
Pilot2 = 20-60X coverage 6 samples(2 trios) Whole-genome sequencing
Pilot3 = 50X coverage 900 samples 1000 genes seqenced
Main project= 4X coverage 2000 samples Whole genome sequence.

But the FTP data is most unwieldly with separate VCF files per population listing every genotype for every individual. Which raises a question:

Is there somewhere that summarises the allele frequencies for SNPs across all the 1KG pilots and combines the populations?
e.g. In pilot3 data for the CEU population we can find SNP rs61733845 has 122 alleles called but if you look up that SNP in dbSNP there is no frequency data.

Last edited by BetterPrimate; 08-04-2010 at 12:41 AM.
BetterPrimate is offline   Reply With Quote
Old 08-04-2010, 12:55 PM   #7
Firebird
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 18
Default

@BetterPrimate

I was also looking for an overall VCF file, but I could only find genotypes per population per pilot study.

An overall files for the whole project would be fine.
Firebird is offline   Reply With Quote
Old 08-09-2010, 03:10 AM   #8
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

At the moment the project ftp doesnt provide overall files for all the variants calls

You can get the vcf files for each sub population used in each pilot from ftp://ftp.1000genomes.ebi.ac.uk/vol1...lease/2010_07/

low coverage represents 180 individuals sequencings to 2-4x
trios represents 2 family trios sequenced to 30x+
exon represents ~700 individuals sequence for 1000 genes

You could use the vcftools sourceforge package to get your frequencies for the whole set

The perl code that is part of this package will merge vcf files for you

http://vcftools.sourceforge.net/perl...html#merge-vcf

and the c++ code will provide frequency reports

http://vcftools.sourceforge.net/options.html
laura is offline   Reply With Quote
Old 08-13-2010, 02:37 AM   #9
Firebird
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 18
Default

Hello,

how can I access the data from the 2000 Individuals sequenced with a 4 x coverage.

Thanks
Firebird is offline   Reply With Quote
Old 08-13-2010, 02:52 AM   #10
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Not all 2500 individuals have been sequenced yet.

So far we have sequence data for 653 samples, 552 have more than 10GB of sequence data available in fastq format

We have alignments for 539 individuals in bam format

You can get all this data from our ftp site

Our website explains how our ftp site is structured

http://1000genomes.org/page.php?page=data#DataAccess
laura is offline   Reply With Quote
Old 08-13-2010, 04:22 AM   #11
Firebird
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 18
Default

Did you also call variants from this 653 samples?

Btw. I have a question about you called variants in the pilot 1 study. Did I undestand it right, that you pooled all the low coverage sequence data and called the variants from this new data set? Don't you loose very rare variants by doing this?
Firebird is offline   Reply With Quote
Old 08-13-2010, 05:00 AM   #12
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

There aren't any variants released yet on the main project data.

We had a release of variants on the pilot data in july which you can find here

ftp://ftp.1000genomes.ebi.ac.uk/vol1...lease/2010_07/

As far as the variant calling goes as most of the low coverage individuals only have between 2 and 4x coverage there is insufficient data to call most variants just from one individual to the pooling of data gains us power. The low coverage approach is less powerful for rare variants
laura is offline   Reply With Quote
Old 08-13-2010, 06:39 AM   #13
Firebird
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 18
Default

Can you please tell me how many individuals are included in the last release?

So with this approach you are only able to call common variants? But isn't it a goal of the project to detect variants with a frequency of less than 1 %?
Firebird is offline   Reply With Quote
Old 08-13-2010, 06:46 AM   #14
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

If you look at the alignment index and sequence index files on the ftp site you can see how many individuals are in each release.

ftp://ftp.1000genomes.ebi.ac.uk/vol1....sequence_data
ftp://ftp.1000genomes.ebi.ac.uk/vol1...alignment_data

With 2500 individuals we can get 95% of 1%MAF alleles in the accessible genome. We will find some variants with lower MAF but we won't find all of them.

This project is designed to find all shared variation within the population rather very rare variants

Another phase of the project is going to do exome sequencing of the 2500 individuals and these will hopefully get variants down to 0.1% in these regions as we will have higher coverage of those regions
laura is offline   Reply With Quote
Old 01-13-2011, 05:58 AM   #15
dilly.desilva
Junior Member
 
Location: London

Join Date: Jan 2011
Posts: 2
Default Allele frequencies in subpopulations 628 individuals

Hi,

I am aware that there is a vcf file "ALL.2of4intersection.20100804.sites.vcf.gz" on the ftp site where you can retrieve allele frequency for SNPs from the low coverage data of 628 individuals. This is pooled across all subpopulations.

Is there a way I can get the allele frequencies for the same SNPs in subpopulations?
dilly.desilva is offline   Reply With Quote
Old 01-13-2011, 06:01 AM   #16
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

There are no precalculated AFs per sub population. You can calculate AN and AC numbers for each sub population and use that to work out an AF though
laura is offline   Reply With Quote
Old 01-13-2011, 07:31 AM   #17
dilly.desilva
Junior Member
 
Location: London

Join Date: Jan 2011
Posts: 2
Default

Thank you Laura,

Where would I get the AC and AN for the separate subpopulations of the 628 individuals from? Its not on the merged SNP set is it?
dilly.desilva is offline   Reply With Quote
Old 01-13-2011, 07:38 AM   #18
andrehorta
Member
 
Location: Brazil - Belo Horizonte - UFMG

Join Date: Jan 2011
Posts: 14
Default

Hi.

I'm newer whith genome. I need yours help. I was downloaded a sequence_read from 1000 genome project (ftp://ftp-trace.ncbi.nih.gov/1000gen.../data/HG00096/), and i sow two foldres, alignment and sequence_read. Wich this folders has a genome? And what's the diference about fastq, fasta, sra and ers? Wich this is genome?
andrehorta is offline   Reply With Quote
Old 01-13-2011, 07:56 AM   #19
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

The sequence_read dir contains the raw sequence reads that have been produced for a particular individual these are in fastq format.
The alignment dir contains alignment files in bam format which aligns the raw reads to a reference genome (in this case GRCh37).

There is more information about this data

http://www.1000genomes.org/data

thanks

Last edited by laura; 01-13-2011 at 07:59 AM.
laura is offline   Reply With Quote
Old 01-13-2011, 07:59 AM   #20
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Quote:
Originally Posted by dilly.desilva View Post
Thank you Laura,

Where would I get the AC and AN for the separate subpopulations of the 628 individuals from? Its not on the merged SNP set is it?

I am afraid you will have to calculate that yourself. The population for each sample is described in ftp://ftp.1000genomes.ebi.ac.uk/vol1...0804.ALL.panel

thanks
laura is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:36 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO