Seqanswers Leaderboard Ad

**gsgs** · 12-06-2012, 05:00 PM

I assume now that the 2 values are for the 2 DNA strains and I take one of them at random
(50%) for an estimate of what a random child of that person might have
at that position in a random selection of one of the 2 strains in a random body cell

----------edit---------------
yes, confirmed by the hapmap helpdesk
see: http://en.wikipedia.org/wiki/Zygosity
the 2 letters arte sorted
-------------------------------------------------------------------------------

Just a moment...

http://gatkforums.broadinstitute.org/discussion/1607/hapmap-references

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_hapmap_RawHapMapCodec.html

Col12 and on: observed genotypes of samples, one per column, sample identifiers in column
headers (Coriell catalog numbers, example: NA10847). Duplicate samples have .dup suffix.

Just a moment...

http://gatkforums.broadinstitute.org/discussion/1690/was-a-different-reference-used-in-the-hapmap-samples-vcf-file-in-bundle-1-5

NCBI

http://hapmap.ncbi.nlm.nih.gov/abouthapmap.html.en

Each person has two copies of all chromosomes except the sex chromosomes.
The set of alleles that a person has is called a genotype. For this SNP a person
could have the genotype AA, AG, or GG. (See http://www.dnaftb.org/dnaftb/ for
basic genetics information.) The term genotype can refer to the SNP alleles that
a person has at a particular SNP, or for many SNPs across the genome.
A method that discovers what genotype a person has is called genotyping.

About 10 million SNPs exist in human populations, where the rarer SNP allele has a
frequency of at least 1%

------------------------edit------------------------------
now I found a short description here:

hapmap [Masterarbeit, Heidi Lischer]

http://heidi.chnebu.ch/doku.php?id=hapmap

hapmap

**gsgs** · 12-06-2012, 06:23 PM

once I succeeded to download and convert the data into a better computer-readable
format, so it can be more easily used, then I'll make it freely available to other researchers
on memory chips (micro-SD), so to save them the trouble of downloading and converting,
which took me ~ 20 hours

**gsgs** · 12-07-2012, 01:13 AM

ftp://ftp.ncbi.nlm.nih.gov/hapmap/ge...I+III/forward/
( 14 google hits for that URL on Dec.07.2012 )

275=25(chromosomes)*11(populations) files

1417 samples = people.
Per population :
ASW,87,CEU,173,CHB,139,CHD,109,GIH,101,JPT,116,LWK,110,MEX,86,MKK,184,TSI,102,YRI,209

4170261 different positions in total
Per chromosome : (files f:\hapmap\gt\lp01,...,lp_m)
01,326089,02,337596,03,268036,04,256273,05,257919
06,282282,07,224360,08,224816,09,190885,10,218893
11,213020,12,204336,13,162980,14,128559,15,111853
16,115419,17,095494,18,124015,19,062364,20,124241
21,054065,22,059327,_x,126193,_y,001032,_m,000214

-----------edit----------
these numbers are now also found at

error page - Uppsala University, Sweden

http://www.uppmax.uu.se/aggregator/sources/2?order=value&sort=asc&page=4

error page

who seems to be just copying stuff from seqanswers

**gsgs** · 12-08-2012, 09:06 PM

I finished downloading the 25*11 files from that directory above
and converted them to fasta files, one for each chromosome,
with 1417 sequences each.
Only the positions listed, those where SNPs do occur.
I took the first nucleotide from the 2 in their pairs.
Positions not listed for one of the 11 populations
or "N" nucleotides are "-" in the fastas.
5.9 GB for the 25 files, they typically compress at a rate of ~30%,
so no size reduction yet as compared to the download.

Most of the 1417 samples only have sparse data, only ~250 have
more data, so it makes sense to only consider those.

I took one random mutation picture, chromosome 8, positions
100000-101200, only the 251 nonsparse sequences were taken
that have more than 999 nucleotides available out of the 1200.
(--> 81 CEU,43 CHB,42 JPT,84 YRI)
Each difference from the average gives a black pixel,
lines are sequences,columns are position.
That gives the first picture of the 3 below.
Then the sequences (rows) are sorted to improve the clustering
- picture 2
Then the positions (columns) are sorted to improve the clustering
- picture 3

usually without recombination/crossover you would only see a few
thick black bars and rectangles as in this picture from human mtDNA:

http://magictour.free.fr/seq/mitogi3.GIF

Attached Files

chr08a.gif (73.3 KB, 13 views)

**gsgs** · 12-09-2012, 10:17 PM

my keyword search led me
to the concept of "linkage disequilibrium (LD)" to calculate mutation-age in
> Recent acceleration of human adaptive evolution (2007)

Page not available - PMC

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2410101/

they also talk about the hapmap-data
> The age of a mutation can be estimated from the decay of linkage
> disequilibrium with flanking or intragenic polymorphisms because of
> recombination and mutation and from the frequency of the mutation
> itself as a consequence of genetic drift. Several methods have been
> proposed, and the results from their applications can be combined
> with population data to provide a critical view of the origin and natural
> history of the mutation.

this paper is cited by:
Recent acceleration of human adaptive evolution

The Evolution of Homophily
Systematic underestimation of the age of selected alleles
Exploring metazoan evolution through dynamic and holistic changes in protein families and domains
Natural Selection Affects Multiple Aspects of Genetic Variation at Putatively Neutral Sites
---------across the Human Genome
Reproductive Benefit of Oxidative Damage: An Oxidative Stress “Malevolence”?
Crohn’s Disease and Genetic Hitchhiking at IBD5
Darwin in Mind: New Opportunities for Evolutionary Psychology
Contributions of Dopamine-Related Genes and Environmental Factors to Highly Sensitive
---------Personality: A Multi-Step Neuronal System-Level Approach
The emergence of human-evolutionary medical genomics
The Evolution of Autistic-Like and Schizotypal Traits: A Sexual Selection Hypothesis
Personality and reproductive success in a high-fertility human population
Climate change: Heat, health, and longer horizons
Gene-culture coevolution in the age of genomics
99th Dahlem Conference on Infection, Inflammation and Chronic Inflammatory Disorders:
----------Darwinian medicine and the ‘hygiene’ or ‘old friends’ hypothesis
The Genetics of Human Adaptation: Hard Sweeps, Soft Sweeps, and Polygenic Adaptation
Consanguinity, human evolution, and complex diseases
Detecting positive selection from genome scans of linkage disequilibrium
Measuring cis-acting regulatory variants genome-wide: new insights into expression genetics
---------and disease susceptibility
Genetic Variation and Recent Positive Selection in Worldwide Human Populations:
------Evidence from Nearly 1 Million SNPs
Evolutionary genomics of human intellectual disability
The Role of Geography in Human Adaptation
Signals of recent positive selection in a worldwide sample of human populations
Review series on helminths, immune modulation and the hygiene hypothesis: The broader
------implications of the hygiene hypothesis
Intergenic DNA sequences from the human X chromosome reveal high rates of global gene flow
Are humans still evolving?: Technological advances and unique biological characteristics
----allow us to adapt to environmental stress. Has this stopped genetic evolution?
The Dawn of Human Matrilineal Diversity
Reconstructing phylogenies and phenotypes: a molecular view of human evolution
AFRICAN GENETIC DIVERSITY: Implications for Human Demographic History, Modern
-------------Human Origins, and

a lot to read ... :-(

most recent: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431732/

**gsgs** · 12-16-2012, 06:26 PM

I found this article:

http://nar.oxfordjournals.org/content/early/2011/11/17/nar.gkr1042.full

to learn about (keywords)
haplotypes
Fosmid-based haplotyping
Single Individual Haplotyping (SIH)
HapMap trio child, NA12878
heterozygous SNPs
phase
diploid genomics
regions with low linkage disequilibrium.
unable to phase variants for which both parents are heterozygous (∼20% of SNPs).

and why the 2 alleles at the positions (=haplotypes)
are sorted in the hapmap-format but unsorted in
the 1000-genomes/omni2123 data.
----------------------------------------------------------------------
Human individuals are diploid, with each somatic cell containing two sets of chromosomes,
one from each parent.

NA12891 (Father); NA12892 (Mother); NA12878 (Daughter).

**gsgs** · 12-19-2012, 09:35 PM

from the data I conclude that males usually only have
one allele=haplotype=phase in the X-chromosome.
Or only one X, (paired with Y) not 2.
Or the 2 are similar, both from the mother, whatever.
I never learned it, may have to look it up
(any good short source recommended ?
wikipedia is usuall good, IMO)

Anyway, no phasing problem in male X.
So I took the 132 full male Xes from hapmap
and made my mutation graphics

http://magictour.free.fr/seq/_xx.GIF

only positions where date from >32 of the 132 are available,
I took the 1st allele, but the 2nd is usually (99.3%) the same.

----------------------------------------

picture of genetic diversity of the X-chromosome

132 male hapmap individuals
39 ceu ,22 chb , 22 jpt ,49 yri (in this order, top to bottom in each of the
163 groups in the picture)
you can print it on 21 sheets of paper
163*720=117000 SNPs

black pixel, iff the nucleotide is different to the average of the 132
at that position. Orange pixel else.
indivituals are lines, positions are columns

**gsgs** · 12-27-2012, 11:25 PM

hah, it seems that they do have phasing information !

NCBI

http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2009-02_phaseIII/HapMap3_r2/hapmap3_r2_phasing_summary.doc

why did noone tell me ...

When I started this thread I had no idea about phasing, that humans
have two sets of DNA in each cell, one from father, one from mother.

Irritating was also that they use "phase" also for the
timely periods of the project.

And "haplotypes" for these 2 parental sets of DNA in the cell-nuclei
as well as for the population groups ?

Anyway.

They have 2022 "Total haplotypes in phased files" and (of course ?!?!)
we want them in that big binary array so we can work with it.
I hereby call it "the canonical SNP matrix (CSM)".
Until I find an existing name or a better name.
of a set of DNAs.

I doubt that I can finish the conversion this year, but who wants
can get the 500MB that I already have.
250*4170261 unphased (random) CSM from hapmap
2123*2177885 phase 1 (=haplotype 1) CSM from 1000-genomes/omni/haplotypes
2123*2177885 phase 2 (=haplotype 2) [=mother ?] CSM as above
positions,statistics,... to be completed
~550 MB in total compressed with 7zip on Win32 micro SD

-------------update----------------
it seems that we get another factor of (at least) 2 with compressing these files,
so divide all these compressed MB by 2 !
So, expect a compression rate of 1% (1:100) or better for typical, big
*.vcf files of human DNA SNPs

---------------edit 2012.12.30-------------------
looking at the amount of hapmap data in the /phased/phaseIII directories,
I think it's probably less than 1000-genomes data, which is also phased.
So maybe not worth to be downloaded + converted, which looks like
much work because of the multiple directories and files for each population
and relatives type. E.g.
chr21, YRI , ~130haplotypes*19306SNPs
while the hapmap data earlier in this thread has 84*51198, although unphased.
And 1000-genomes has 2*2123*30000 pixels in the CSM for YRI,chr21
some of which could be redundant (parent and child) though

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

hapmap format

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News