SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
1000 Genomes GVF Format NextGenSeq Bioinformatics 4 07-19-2016 12:50 AM
1000 genomes VCF format? yuanzhi Bioinformatics 14 02-26-2013 01:16 AM
determing zygosity with 1000 genomes SAM format data ahstram Bioinformatics 0 12-04-2009 03:31 PM
.bam alignment format from 1000 Genomes Project CellsDividing Bioinformatics 2 02-06-2009 08:21 AM
1000 genomes fastq format uam Bioinformatics 1 12-10-2008 07:51 AM

Reply
 
Thread Tools
Old 12-11-2012, 08:53 AM   #1
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default 1000 genomes data format

paper , 10 pages .pdf:
http://www.nature.com/nature/journal...ature11632.pdf
supplementary material, 113 pages .pdf:
http://www.nature.com/nature/journal...re11632-s1.pdf
==============================================================


after starting with the hapmap data (which looked easier) in this thread:
http://seqanswers.com/forums/showthread.php?t=25554
I arrived (again) at the 1000 genomes data which apparently is more extensive.


1000genomes,2012 paper,page 60,suppl.
10.5 Haplotype estimation from OMNI data
2123
327 trios,42 duos, 1058 singles
2177885 SNPs (-->~4.4 times more data than hapmap with 250 individuals at ~4M SNPs)
http://ftp.1000genomes.ebi.ac.uk/vol...ni_haplotypes/
no x,y,m chromosomes

4 bytes per (individual,position) expanded : "1|1","1|0","0|1","0|0" plus ascii 009
so, only 2 bits and that corresponds to the compression rate.

11,10,01,00 could correspond to the 2-letter entries in the hapmap data, where the letters are given in
columns 2,3 and 1,0 indicate which one of them is chosen in which Zygote
however hapmap would either have 01 or 10, never both (Zygotes indistinguishable)

anyway, you choose one of the values at random and get that binary "diversity-matrix",
here of size 2123x2177885, 578MB
as compared to 250x4170000, 130MB for hapmap
Well, it's not really binary since there are empty positions (~10%)
I think we should somehow fill them in a random, unbiased way so to preserve the structure
and statistical content.

http://www.1000genomes.org/wiki/Anal...mat-version-41
page 4, Genotype fields , oGT
> If genotype information is present, then the same types of data must be present for all samples.
> First a FORMAT field is given specifying the data types and order (colon-separated alphanumeric
> String). This is followed by one field per sample, with the colon-separated data in this field
> corresponding to the types specified in the format. The first sub-field must always be the
> genotype (GT) if it is present. There are no required sub-fields.

Last edited by gsgs; 12-19-2012 at 04:39 PM.
gsgs is offline   Reply With Quote
Old 12-11-2012, 09:01 AM   #2
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

anyway, it seems that you always arrive at such a binary "diversity-matrix",
individuals over SNP-positions(all chromosomes)
of size 2123x2177885, 578MB for 1000 genomes
of size 250x4170000, 130MB for hapmap
The bit at position (x,y) is set, iff sequence x differs at position y from the consensus
(or average) in one of the 2 diploid alleles, chosen at random


We just need that giant binary matrix, how is it called ? Is there a math-theory
about its properties,manipulation,relation to other objects,... already ?
Shouldn't they offer that matrix for download directly ? (-->easier)

Last edited by gsgs; 12-11-2012 at 09:05 AM.
gsgs is offline   Reply With Quote
Old 12-17-2012, 01:27 AM   #3
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

there are 445.6 TB (358954 files , 20992 directories) listed in current_tree on the ftp-site
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
most (99.6%) TB are in the 3 data subdirs and the 3 technical subdirs :

Code:
subdir,files,terabytes
data:
main:153136,114.6
phas:013311,056.7
pilo:035889,013.1
---------------------------
data:202336files,184.4TB
 
technical:
main:144568,218.9
phas:007899,040.1
pilo:000930,000.3
--------------------------
tech:153397files,259.3TB
 
tota:358954files,445.6TB
ftp:/data has 2456 genomenames and thus 2456 subdirectories HG00096...NA21144,
which in turn have 3 subdirectories each , alignment,exome_alignment,sequence_read
In total there are 2-3021 files in those 3 subdirectories.
82 have more than 288 files
192 have more than 253 files
322 have more than 188 files

Last edited by gsgs; 12-17-2012 at 01:31 AM.
gsgs is offline   Reply With Quote
Old 12-24-2012, 12:26 PM   #4
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

I finished downloading and converting the SNP-data
from the 22 chromosomes at
/omni/haplotypes
2123 individuals, 1.1GB compressed, 18.5 GB expanded

SNPs in the 22 available chromosomes:
+ 174234 + 182881 + 153373 + 142614 + 136754 + 136611
+ 121998 + 118385 + 097989 + 112704 + 109763 + 105824
+ 077675 + 072264 + 068091 + 073797 + 064155 + 064176
+ 047536 + 054241 + 030040 + 032780 = 2177885

2*4.6 GB for the 2*23 fasta files uncompressed ~0.9GB compressed

(~450000 GB total 1000-genome data
gsgs is offline   Reply With Quote
Old 01-07-2013, 07:35 AM   #5
rama
Member
 
Location: Boston, USA

Join Date: Jan 2011
Posts: 20
Default

does anyone have an explanation why the number variants vary significantly between release/20100804 and 20110521 for any sample. I looked for the variants listed for NA10851 sample from ftp://ftp.1000genomes.ebi.ac.uk/vol1...lease/20110521 and
ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz, and found 39.7 and 14.6 million variants respectively. I expected to see some differences on genome locations but was surprised to see big difference in the number of calls. I was wondering if any one knows what factors other than difference in the genome build could be accounted for such huge difference. thanks in advance for your kind help.

Last edited by rama; 01-07-2013 at 07:38 AM.
rama is offline   Reply With Quote
Old 08-28-2017, 11:51 PM   #6
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

Quote:
Originally Posted by gsgs View Post
anyway, it seems that you always arrive at such a binary "diversity-matrix",
individuals over SNP-positions(all chromosomes)
of size 2123x2177885, 578MB for 1000 genomes
of size 250x4170000, 130MB for hapmap
The bit at position (x,y) is set, iff sequence x differs at position y from the consensus
(or average) in one of the 2 diploid alleles, chosen at random


We just need that giant binary matrix, how is it called ? Is there a math-theory
about its properties,manipulation,relation to other objects,... already ?
Shouldn't they offer that matrix for download directly ? (-->easier)

I found it here:

https://www.researchgate.net/figure/...s-are-compared

"HapZipper" , a paper from 2012, but no followup :-(

Their purpose is (was) to compress the data, rather than to analyse,manipulate,characterize,search,order,share
them efficiently


now you can reorder the rows [individuals] and columns [SNP-locations] of that matrix
so to achieve the smallest sum of differences between neighbor rows and columns
using some "traveling salesman" algorithm and then offer that matrix as a giant
zoomable .pdf picture as a picture of human genetic diversity - to be compared with
other species

Who will do it (first) ?


[ I was trying to download genbank's dbSNP data, but couldn't figure out
the format, what to download, how to convert it.
https://en.wikipedia.org/wiki/DbSNP ]

I'd like to have the HapZipper matrix of these 2280 human public domain genomes:
http://www.biorxiv.org/content/early/2017/04/19/127241

Last edited by gsgs; 08-29-2017 at 12:28 AM.
gsgs is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:35 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO