Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gsgs
    Senior Member
    • Oct 2009
    • 139

    hapmap format

    I can't find this ...


    in those typical hapmap files from
    ftp://ftp.ncbi.nlm.nih.gov/hapmap/ge...I+III/forward/


    like:
    (blanks replaced with comma)

    Code:
    rs#,alleles,chrom,pos,strand,assembly#,center,protLSID,assayLSID,panelLSID,QCcode,NA06984,...,NA12892
    rs28412942,A/T,chrM,410,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8575126:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,TT,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,NN,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,AA,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,AA,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs3937039,A/G,chrM,665,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt663:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,AA,NN,NN,AA,AA,AA,NN,NN,AA,NN,AA,AA,AA,NN,AA,NN,NN,AA,NN,AA,AA,AA,NN,NN,AA,NN,AA,NN,AA,AA,AA,NN,NN,AA,AA,NN,NN,NN,AA,AA,NN,AA,NN,NN,AA,AA,AA,AA,AA,AA,AA,AA,NN,NN,AA,AA,AA,AA,AA,AA,NN,AA,AA,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,NN,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,AA,NN,AA,AA,NN,NN,AA,NN,NN,AA,AA,NN,AA,AA,AA,AA,AA,NN,NN,NN,NN,NN,NN,AA,AA,AA,AA,AA,AA,NN,NN,NN,NN,NN,NN,NN,NN,NN,AA,AA,AA,AA,AA,AA,NN,AA,NN,NN,AA,AA
    rs2853517,A/G,chrM,711,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt709:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,GG,NN,NN,GG,GG,GG,NN,NN,AA,NN,AA,GG,AA,NN,GG,NN,NN,NN,NN,GG,AA,GG,NN,NN,GG,NN,GG,NN,GG,GG,GG,NN,NN,GG,GG,NN,NN,NN,GG,GG,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,GG,GG,GG,GG,GG,NN,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,AA,GG,AA,GG,AA,GG,AA,GG,NN,GG,GG,GG,GG,GG,GG,AA,GG,GG,GG,GG,NN,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,NN,GG,GG,NN,NN,GG,NN,NN,AA,GG,GG,AA,GG,GG,AA,AA,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,NN,NN,GG,GG
    rs28358568,C/T,chrM,712,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt710:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs2853519,A/G,chrM,771,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8574695:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,GG,NN,NN,GG,GG,GG,NN,NN,GG,NN,GG,GG,GG,NN,GG,NN,NN,GG,NN,GG,GG,GG,NN,NN,GG,NN,GG,NN,GG,GG,GG,NN,NN,GG,GG,NN,NN,NN,GG,GG,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,NN,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,NN,GG,GG,NN,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,NN,NN,GG,GG
    rs2853520,A/T,chrM,827,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt825:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,TT,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs28358570,C/T,chrM,923,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8574945:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,TT,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs2856982,A/G,chrM,1020,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8574722:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,GG,NN,NN,GG,GG,GG,NN,NN,GG,NN,GG,GG,GG,NN,GG,NN,NN,GG,NN,GG,GG,GG,NN,NN,GG,NN,GG,NN,GG,GG,GG,NN,NN,GG,GG,NN,NN,NN,GG,GG,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,NN,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,NN,GG,GG,NN,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,NN,NN,GG,GG
    rs2000974,C/T,chrM,1050,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8574535:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,CC,NN,NN,CC,CC,CC,NN,NN,CC,NN,CC,CC,CC,NN,CC,NN,NN,CC,NN,CC,CC,CC,NN,NN,CC,NN,CC,NN,CC,CC,CC,NN,NN,CC,CC,NN,NN,NN,CC,CC,NN,CC,NN,NN,CC,CC,CC,CC,CC,CC,CC,CC,NN,NN,CC,CC,CC,CC,CC,CC,NN,CC,CC,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,NN,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,CC,NN,CC,CC,NN,NN,CC,NN,NN,CC,CC,CC,CC,CC,CC,CC,CC,NN,NN,NN,NN,NN,NN,CC,CC,CC,CC,CC,CC,NN,NN,NN,NN,NN,NN,NN,NN,NN,CC,CC,CC,CC,CC,CC,NN,CC,NN,NN,CC,CC
    rs28358571,C/T,chrM,1191,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt1189:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,TT,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,CC,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,CC,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs28358572,C/T,chrM,1245,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt1243:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,CC,TT,TT,NN,TT,NN,NN,NN,NN,TT,CC,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    ...

    what do the double-letter entries mean ?
    that sample (=person) has one of the 2 letters at that position in that chromosome ? which of the two ?


    ACGT
    (N=not available)


    more keywords for google:

    98,185,150,120,112,127,121,97,195,113,220 total=1417
    of such entries in the 11 populations
    asw,ceu,chb,chd,gih,jpt,lwk,mex,mkk,tsi,yri


    found this comparison, Mar,2012, 1000 genomes vs. hapmap
    Since publication of the human genome in 2003, geneticists have been interested in risk variant associations to resolve the etiology of traits and complex diseases. The International HapMap Consortium undertook an effort to catalog all common ...


    found another thread that counts ~15M SNPs via hapmap
    but I get in that directory ~4.3M SNP-positions only

    so far I have SNP-positions:
    20:124241
    01:326089
    02:337596
    21:54065
    22:59327
    10:218893
    _x:126192
    _y:1032
    _m:214

    no google-hits with these numbers
    do I really need all chromosomes
    Last edited by gsgs; 12-10-2012, 11:19 AM.
  • gsgs
    Senior Member
    • Oct 2009
    • 139

    #2
    I assume now that the 2 values are for the 2 DNA strains and I take one of them at random
    (50%) for an estimate of what a random child of that person might have
    at that position in a random selection of one of the 2 strains in a random body cell

    ----------edit---------------
    yes, confirmed by the hapmap helpdesk
    see: http://en.wikipedia.org/wiki/Zygosity
    the 2 letters arte sorted
    -------------------------------------------------------------------------------





    Col12 and on: observed genotypes of samples, one per column, sample identifiers in column
    headers (Coriell catalog numbers, example: NA10847). Duplicate samples have .dup suffix.






    Each person has two copies of all chromosomes except the sex chromosomes.
    The set of alleles that a person has is called a genotype. For this SNP a person
    could have the genotype AA, AG, or GG. (See http://www.dnaftb.org/dnaftb/ for
    basic genetics information.) The term genotype can refer to the SNP alleles that
    a person has at a particular SNP, or for many SNPs across the genome.
    A method that discovers what genotype a person has is called genotyping.

    About 10 million SNPs exist in human populations, where the rarer SNP allele has a
    frequency of at least 1%

    ------------------------edit------------------------------
    now I found a short description here:
    Last edited by gsgs; 12-10-2012, 11:18 AM.

    Comment

    • gsgs
      Senior Member
      • Oct 2009
      • 139

      #3
      once I succeeded to download and convert the data into a better computer-readable
      format, so it can be more easily used, then I'll make it freely available to other researchers
      on memory chips (micro-SD), so to save them the trouble of downloading and converting,
      which took me ~ 20 hours

      Comment

      • gsgs
        Senior Member
        • Oct 2009
        • 139

        #4
        ftp://ftp.ncbi.nlm.nih.gov/hapmap/ge...I+III/forward/
        ( 14 google hits for that URL on Dec.07.2012 )

        275=25(chromosomes)*11(populations) files

        1417 samples = people.
        Per population :
        ASW,87,CEU,173,CHB,139,CHD,109,GIH,101,JPT,116,LWK,110,MEX,86,MKK,184,TSI,102,YRI,209

        4170261 different positions in total
        Per chromosome : (files f:\hapmap\gt\lp01,...,lp_m)
        01,326089,02,337596,03,268036,04,256273,05,257919
        06,282282,07,224360,08,224816,09,190885,10,218893
        11,213020,12,204336,13,162980,14,128559,15,111853
        16,115419,17,095494,18,124015,19,062364,20,124241
        21,054065,22,059327,_x,126193,_y,001032,_m,000214

        -----------edit----------
        these numbers are now also found at

        who seems to be just copying stuff from seqanswers
        Last edited by gsgs; 12-16-2012, 01:03 AM.

        Comment

        • gsgs
          Senior Member
          • Oct 2009
          • 139

          #5
          I finished downloading the 25*11 files from that directory above
          and converted them to fasta files, one for each chromosome,
          with 1417 sequences each.
          Only the positions listed, those where SNPs do occur.
          I took the first nucleotide from the 2 in their pairs.
          Positions not listed for one of the 11 populations
          or "N" nucleotides are "-" in the fastas.
          5.9 GB for the 25 files, they typically compress at a rate of ~30%,
          so no size reduction yet as compared to the download.

          Most of the 1417 samples only have sparse data, only ~250 have
          more data, so it makes sense to only consider those.

          I took one random mutation picture, chromosome 8, positions
          100000-101200, only the 251 nonsparse sequences were taken
          that have more than 999 nucleotides available out of the 1200.
          (--> 81 CEU,43 CHB,42 JPT,84 YRI)
          Each difference from the average gives a black pixel,
          lines are sequences,columns are position.
          That gives the first picture of the 3 below.
          Then the sequences (rows) are sorted to improve the clustering
          - picture 2
          Then the positions (columns) are sorted to improve the clustering
          - picture 3

          usually without recombination/crossover you would only see a few
          thick black bars and rectangles as in this picture from human mtDNA:
          Attached Files
          Last edited by gsgs; 12-08-2012, 09:31 PM.

          Comment

          • gsgs
            Senior Member
            • Oct 2009
            • 139

            #6
            my keyword search led me
            to the concept of "linkage disequilibrium (LD)" to calculate mutation-age in
            > Recent acceleration of human adaptive evolution (2007)
            Genomic surveys in humans identify a large amount of recent positive selection. Using the 3.9-million HapMap SNP dataset, we found that selection has accelerated greatly during the last 40,000 years. We tested the null hypothesis that the observed ...


            they also talk about the hapmap-data
            > The age of a mutation can be estimated from the decay of linkage
            > disequilibrium with flanking or intragenic polymorphisms because of
            > recombination and mutation and from the frequency of the mutation
            > itself as a consequence of genetic drift. Several methods have been
            > proposed, and the results from their applications can be combined
            > with population data to provide a critical view of the origin and natural
            > history of the mutation.


            this paper is cited by:
            Recent acceleration of human adaptive evolution

            The Evolution of Homophily
            Systematic underestimation of the age of selected alleles
            Exploring metazoan evolution through dynamic and holistic changes in protein families and domains
            Natural Selection Affects Multiple Aspects of Genetic Variation at Putatively Neutral Sites
            ---------across the Human Genome
            Reproductive Benefit of Oxidative Damage: An Oxidative Stress “Malevolence”?
            Crohn’s Disease and Genetic Hitchhiking at IBD5
            Darwin in Mind: New Opportunities for Evolutionary Psychology
            Contributions of Dopamine-Related Genes and Environmental Factors to Highly Sensitive
            ---------Personality: A Multi-Step Neuronal System-Level Approach
            The emergence of human-evolutionary medical genomics
            The Evolution of Autistic-Like and Schizotypal Traits: A Sexual Selection Hypothesis
            Personality and reproductive success in a high-fertility human population
            Climate change: Heat, health, and longer horizons
            Gene-culture coevolution in the age of genomics
            99th Dahlem Conference on Infection, Inflammation and Chronic Inflammatory Disorders:
            ----------Darwinian medicine and the ‘hygiene’ or ‘old friends’ hypothesis
            The Genetics of Human Adaptation: Hard Sweeps, Soft Sweeps, and Polygenic Adaptation
            Consanguinity, human evolution, and complex diseases
            Detecting positive selection from genome scans of linkage disequilibrium
            Measuring cis-acting regulatory variants genome-wide: new insights into expression genetics
            ---------and disease susceptibility
            Genetic Variation and Recent Positive Selection in Worldwide Human Populations:
            ------Evidence from Nearly 1 Million SNPs
            Evolutionary genomics of human intellectual disability
            The Role of Geography in Human Adaptation
            Signals of recent positive selection in a worldwide sample of human populations
            Review series on helminths, immune modulation and the hygiene hypothesis: The broader
            ------implications of the hygiene hypothesis
            Intergenic DNA sequences from the human X chromosome reveal high rates of global gene flow
            Are humans still evolving?: Technological advances and unique biological characteristics
            ----allow us to adapt to environmental stress. Has this stopped genetic evolution?
            The Dawn of Human Matrilineal Diversity
            Reconstructing phylogenies and phenotypes: a molecular view of human evolution
            AFRICAN GENETIC DIVERSITY: Implications for Human Demographic History, Modern
            -------------Human Origins, and


            a lot to read ... :-(

            most recent: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431732/
            Last edited by gsgs; 12-09-2012, 10:41 PM.

            Comment

            • gsgs
              Senior Member
              • Oct 2009
              • 139

              #7
              I found this article:

              to learn about (keywords)
              haplotypes
              Fosmid-based haplotyping
              Single Individual Haplotyping (SIH)
              HapMap trio child, NA12878
              heterozygous SNPs
              phase
              diploid genomics
              regions with low linkage disequilibrium.
              unable to phase variants for which both parents are heterozygous (∼20% of SNPs).

              and why the 2 alleles at the positions (=haplotypes)
              are sorted in the hapmap-format but unsorted in
              the 1000-genomes/omni2123 data.
              ----------------------------------------------------------------------
              Human individuals are diploid, with each somatic cell containing two sets of chromosomes,
              one from each parent.

              NA12891 (Father); NA12892 (Mother); NA12878 (Daughter).
              Last edited by gsgs; 12-16-2012, 07:37 PM.

              Comment

              • gsgs
                Senior Member
                • Oct 2009
                • 139

                #8
                from the data I conclude that males usually only have
                one allele=haplotype=phase in the X-chromosome.
                Or only one X, (paired with Y) not 2.
                Or the 2 are similar, both from the mother, whatever.
                I never learned it, may have to look it up
                (any good short source recommended ?
                wikipedia is usuall good, IMO)

                Anyway, no phasing problem in male X.
                So I took the 132 full male Xes from hapmap
                and made my mutation graphics



                only positions where date from >32 of the 132 are available,
                I took the 1st allele, but the 2nd is usually (99.3%) the same.

                ----------------------------------------

                picture of genetic diversity of the X-chromosome


                132 male hapmap individuals
                39 ceu ,22 chb , 22 jpt ,49 yri (in this order, top to bottom in each of the
                163 groups in the picture)
                you can print it on 21 sheets of paper
                163*720=117000 SNPs

                black pixel, iff the nucleotide is different to the average of the 132
                at that position. Orange pixel else.
                indivituals are lines, positions are columns
                Last edited by gsgs; 12-19-2012, 10:23 PM.

                Comment

                • gsgs
                  Senior Member
                  • Oct 2009
                  • 139

                  #9
                  hah, it seems that they do have phasing information !



                  why did noone tell me ...

                  When I started this thread I had no idea about phasing, that humans
                  have two sets of DNA in each cell, one from father, one from mother.

                  Irritating was also that they use "phase" also for the
                  timely periods of the project.

                  And "haplotypes" for these 2 parental sets of DNA in the cell-nuclei
                  as well as for the population groups ?

                  Anyway.

                  They have 2022 "Total haplotypes in phased files" and (of course ?!?!)
                  we want them in that big binary array so we can work with it.
                  I hereby call it "the canonical SNP matrix (CSM)".
                  Until I find an existing name or a better name.
                  of a set of DNAs.

                  I doubt that I can finish the conversion this year, but who wants
                  can get the 500MB that I already have.
                  250*4170261 unphased (random) CSM from hapmap
                  2123*2177885 phase 1 (=haplotype 1) CSM from 1000-genomes/omni/haplotypes
                  2123*2177885 phase 2 (=haplotype 2) [=mother ?] CSM as above
                  positions,statistics,... to be completed
                  ~550 MB in total compressed with 7zip on Win32 micro SD

                  -------------update----------------
                  it seems that we get another factor of (at least) 2 with compressing these files,
                  so divide all these compressed MB by 2 !
                  So, expect a compression rate of 1% (1:100) or better for typical, big
                  *.vcf files of human DNA SNPs

                  ---------------edit 2012.12.30-------------------
                  looking at the amount of hapmap data in the /phased/phaseIII directories,
                  I think it's probably less than 1000-genomes data, which is also phased.
                  So maybe not worth to be downloaded + converted, which looks like
                  much work because of the multiple directories and files for each population
                  and relatives type. E.g.
                  chr21, YRI , ~130haplotypes*19306SNPs
                  while the hapmap data earlier in this thread has 84*51198, although unphased.
                  And 1000-genomes has 2*2123*30000 pixels in the CSM for YRI,chr21
                  some of which could be redundant (parent and child) though
                  Last edited by gsgs; 12-29-2012, 12:54 AM.

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM
                  • SEQadmin2
                    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                    by SEQadmin2


                    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                    Introduction

                    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                    05-22-2026, 06:42 AM
                  • SEQadmin2
                    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                    by SEQadmin2

                    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                    05-06-2026, 09:04 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  20 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  14 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-28-2026, 11:40 AM
                  0 responses
                  29 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-26-2026, 10:12 AM
                  0 responses
                  31 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...