Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • hapmap format

    I can't find this ...


    in those typical hapmap files from
    ftp://ftp.ncbi.nlm.nih.gov/hapmap/ge...I+III/forward/


    like:
    (blanks replaced with comma)

    Code:
    rs#,alleles,chrom,pos,strand,assembly#,center,protLSID,assayLSID,panelLSID,QCcode,NA06984,...,NA12892
    rs28412942,A/T,chrM,410,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8575126:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,TT,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,NN,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,AA,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,AA,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs3937039,A/G,chrM,665,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt663:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,AA,NN,NN,AA,AA,AA,NN,NN,AA,NN,AA,AA,AA,NN,AA,NN,NN,AA,NN,AA,AA,AA,NN,NN,AA,NN,AA,NN,AA,AA,AA,NN,NN,AA,AA,NN,NN,NN,AA,AA,NN,AA,NN,NN,AA,AA,AA,AA,AA,AA,AA,AA,NN,NN,AA,AA,AA,AA,AA,AA,NN,AA,AA,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,NN,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,AA,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,AA,NN,AA,AA,NN,NN,AA,NN,NN,AA,AA,NN,AA,AA,AA,AA,AA,NN,NN,NN,NN,NN,NN,AA,AA,AA,AA,AA,AA,NN,NN,NN,NN,NN,NN,NN,NN,NN,AA,AA,AA,AA,AA,AA,NN,AA,NN,NN,AA,AA
    rs2853517,A/G,chrM,711,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt709:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,GG,NN,NN,GG,GG,GG,NN,NN,AA,NN,AA,GG,AA,NN,GG,NN,NN,NN,NN,GG,AA,GG,NN,NN,GG,NN,GG,NN,GG,GG,GG,NN,NN,GG,GG,NN,NN,NN,GG,GG,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,GG,GG,GG,GG,GG,NN,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,AA,GG,AA,GG,AA,GG,AA,GG,NN,GG,GG,GG,GG,GG,GG,AA,GG,GG,GG,GG,NN,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,NN,GG,GG,NN,NN,GG,NN,NN,AA,GG,GG,AA,GG,GG,AA,AA,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,NN,NN,GG,GG
    rs28358568,C/T,chrM,712,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt710:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs2853519,A/G,chrM,771,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8574695:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,GG,NN,NN,GG,GG,GG,NN,NN,GG,NN,GG,GG,GG,NN,GG,NN,NN,GG,NN,GG,GG,GG,NN,NN,GG,NN,GG,NN,GG,GG,GG,NN,NN,GG,GG,NN,NN,NN,GG,GG,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,NN,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,NN,GG,GG,NN,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,NN,NN,GG,GG
    rs2853520,A/T,chrM,827,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt825:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,TT,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs28358570,C/T,chrM,923,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8574945:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,TT,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs2856982,A/G,chrM,1020,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8574722:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,GG,NN,NN,GG,GG,GG,NN,NN,GG,NN,GG,GG,GG,NN,GG,NN,NN,GG,NN,GG,GG,GG,NN,NN,GG,NN,GG,NN,GG,GG,GG,NN,NN,GG,GG,NN,NN,NN,GG,GG,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,NN,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,NN,GG,GG,NN,NN,GG,NN,NN,GG,GG,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,NN,NN,NN,NN,NN,NN,NN,NN,GG,GG,GG,GG,GG,GG,NN,GG,NN,NN,GG,GG
    rs2000974,C/T,chrM,1050,+,ncbi_B36,affymetrix,urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2,urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-8574535:2,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,CC,NN,NN,CC,CC,CC,NN,NN,CC,NN,CC,CC,CC,NN,CC,NN,NN,CC,NN,CC,CC,CC,NN,NN,CC,NN,CC,NN,CC,CC,CC,NN,NN,CC,CC,NN,NN,NN,CC,CC,NN,CC,NN,NN,CC,CC,CC,CC,CC,CC,CC,CC,NN,NN,CC,CC,CC,CC,CC,CC,NN,CC,CC,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,NN,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,CC,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,CC,NN,CC,CC,NN,NN,CC,NN,NN,CC,CC,CC,CC,CC,CC,CC,CC,NN,NN,NN,NN,NN,NN,CC,CC,CC,CC,CC,CC,NN,NN,NN,NN,NN,NN,NN,NN,NN,CC,CC,CC,CC,CC,CC,NN,CC,NN,NN,CC,CC
    rs28358571,C/T,chrM,1191,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt1189:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,TT,TT,TT,NN,TT,NN,NN,TT,NN,TT,TT,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,CC,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,CC,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    rs28358572,C/T,chrM,1245,+,ncbi_b36,broad,urn:lsid:wicgr.hapmap.org:Protocol:genotype_protocol_1:1,urn:lsid:wicgr.hapmap.org:Assay:MITOCHONDRIA-mt1243:1,urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1,QC+,NN,TT,NN,NN,TT,TT,TT,NN,NN,TT,NN,CC,TT,TT,NN,TT,NN,NN,NN,NN,TT,CC,TT,NN,NN,TT,NN,TT,NN,TT,TT,TT,NN,NN,TT,TT,NN,NN,NN,TT,TT,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,TT,TT,TT,TT,TT,NN,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,NN,TT,TT,NN,NN,TT,NN,NN,TT,TT,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,NN,NN,NN,NN,NN,NN,NN,NN,TT,TT,TT,TT,TT,TT,NN,TT,NN,NN,TT,TT
    ...

    what do the double-letter entries mean ?
    that sample (=person) has one of the 2 letters at that position in that chromosome ? which of the two ?


    ACGT
    (N=not available)


    more keywords for google:

    98,185,150,120,112,127,121,97,195,113,220 total=1417
    of such entries in the 11 populations
    asw,ceu,chb,chd,gih,jpt,lwk,mex,mkk,tsi,yri


    found this comparison, Mar,2012, 1000 genomes vs. hapmap


    found another thread that counts ~15M SNPs via hapmap
    but I get in that directory ~4.3M SNP-positions only

    so far I have SNP-positions:
    20:124241
    01:326089
    02:337596
    21:54065
    22:59327
    10:218893
    _x:126192
    _y:1032
    _m:214

    no google-hits with these numbers
    do I really need all chromosomes
    Last edited by gsgs; 12-10-2012, 11:19 AM.

  • #2
    I assume now that the 2 values are for the 2 DNA strains and I take one of them at random
    (50%) for an estimate of what a random child of that person might have
    at that position in a random selection of one of the 2 strains in a random body cell

    ----------edit---------------
    yes, confirmed by the hapmap helpdesk
    see: http://en.wikipedia.org/wiki/Zygosity
    the 2 letters arte sorted
    -------------------------------------------------------------------------------





    Col12 and on: observed genotypes of samples, one per column, sample identifiers in column
    headers (Coriell catalog numbers, example: NA10847). Duplicate samples have .dup suffix.






    Each person has two copies of all chromosomes except the sex chromosomes.
    The set of alleles that a person has is called a genotype. For this SNP a person
    could have the genotype AA, AG, or GG. (See http://www.dnaftb.org/dnaftb/ for
    basic genetics information.) The term genotype can refer to the SNP alleles that
    a person has at a particular SNP, or for many SNPs across the genome.
    A method that discovers what genotype a person has is called genotyping.

    About 10 million SNPs exist in human populations, where the rarer SNP allele has a
    frequency of at least 1%

    ------------------------edit------------------------------
    now I found a short description here:
    Last edited by gsgs; 12-10-2012, 11:18 AM.

    Comment


    • #3
      once I succeeded to download and convert the data into a better computer-readable
      format, so it can be more easily used, then I'll make it freely available to other researchers
      on memory chips (micro-SD), so to save them the trouble of downloading and converting,
      which took me ~ 20 hours

      Comment


      • #4
        ftp://ftp.ncbi.nlm.nih.gov/hapmap/ge...I+III/forward/
        ( 14 google hits for that URL on Dec.07.2012 )

        275=25(chromosomes)*11(populations) files

        1417 samples = people.
        Per population :
        ASW,87,CEU,173,CHB,139,CHD,109,GIH,101,JPT,116,LWK,110,MEX,86,MKK,184,TSI,102,YRI,209

        4170261 different positions in total
        Per chromosome : (files f:\hapmap\gt\lp01,...,lp_m)
        01,326089,02,337596,03,268036,04,256273,05,257919
        06,282282,07,224360,08,224816,09,190885,10,218893
        11,213020,12,204336,13,162980,14,128559,15,111853
        16,115419,17,095494,18,124015,19,062364,20,124241
        21,054065,22,059327,_x,126193,_y,001032,_m,000214

        -----------edit----------
        these numbers are now also found at

        who seems to be just copying stuff from seqanswers
        Last edited by gsgs; 12-16-2012, 01:03 AM.

        Comment


        • #5
          I finished downloading the 25*11 files from that directory above
          and converted them to fasta files, one for each chromosome,
          with 1417 sequences each.
          Only the positions listed, those where SNPs do occur.
          I took the first nucleotide from the 2 in their pairs.
          Positions not listed for one of the 11 populations
          or "N" nucleotides are "-" in the fastas.
          5.9 GB for the 25 files, they typically compress at a rate of ~30%,
          so no size reduction yet as compared to the download.

          Most of the 1417 samples only have sparse data, only ~250 have
          more data, so it makes sense to only consider those.

          I took one random mutation picture, chromosome 8, positions
          100000-101200, only the 251 nonsparse sequences were taken
          that have more than 999 nucleotides available out of the 1200.
          (--> 81 CEU,43 CHB,42 JPT,84 YRI)
          Each difference from the average gives a black pixel,
          lines are sequences,columns are position.
          That gives the first picture of the 3 below.
          Then the sequences (rows) are sorted to improve the clustering
          - picture 2
          Then the positions (columns) are sorted to improve the clustering
          - picture 3

          usually without recombination/crossover you would only see a few
          thick black bars and rectangles as in this picture from human mtDNA:
          Attached Files
          Last edited by gsgs; 12-08-2012, 09:31 PM.

          Comment


          • #6
            my keyword search led me
            to the concept of "linkage disequilibrium (LD)" to calculate mutation-age in
            > Recent acceleration of human adaptive evolution (2007)


            they also talk about the hapmap-data
            > The age of a mutation can be estimated from the decay of linkage
            > disequilibrium with flanking or intragenic polymorphisms because of
            > recombination and mutation and from the frequency of the mutation
            > itself as a consequence of genetic drift. Several methods have been
            > proposed, and the results from their applications can be combined
            > with population data to provide a critical view of the origin and natural
            > history of the mutation.


            this paper is cited by:
            Recent acceleration of human adaptive evolution

            The Evolution of Homophily
            Systematic underestimation of the age of selected alleles
            Exploring metazoan evolution through dynamic and holistic changes in protein families and domains
            Natural Selection Affects Multiple Aspects of Genetic Variation at Putatively Neutral Sites
            ---------across the Human Genome
            Reproductive Benefit of Oxidative Damage: An Oxidative Stress “Malevolence”?
            Crohn’s Disease and Genetic Hitchhiking at IBD5
            Darwin in Mind: New Opportunities for Evolutionary Psychology
            Contributions of Dopamine-Related Genes and Environmental Factors to Highly Sensitive
            ---------Personality: A Multi-Step Neuronal System-Level Approach
            The emergence of human-evolutionary medical genomics
            The Evolution of Autistic-Like and Schizotypal Traits: A Sexual Selection Hypothesis
            Personality and reproductive success in a high-fertility human population
            Climate change: Heat, health, and longer horizons
            Gene-culture coevolution in the age of genomics
            99th Dahlem Conference on Infection, Inflammation and Chronic Inflammatory Disorders:
            ----------Darwinian medicine and the ‘hygiene’ or ‘old friends’ hypothesis
            The Genetics of Human Adaptation: Hard Sweeps, Soft Sweeps, and Polygenic Adaptation
            Consanguinity, human evolution, and complex diseases
            Detecting positive selection from genome scans of linkage disequilibrium
            Measuring cis-acting regulatory variants genome-wide: new insights into expression genetics
            ---------and disease susceptibility
            Genetic Variation and Recent Positive Selection in Worldwide Human Populations:
            ------Evidence from Nearly 1 Million SNPs
            Evolutionary genomics of human intellectual disability
            The Role of Geography in Human Adaptation
            Signals of recent positive selection in a worldwide sample of human populations
            Review series on helminths, immune modulation and the hygiene hypothesis: The broader
            ------implications of the hygiene hypothesis
            Intergenic DNA sequences from the human X chromosome reveal high rates of global gene flow
            Are humans still evolving?: Technological advances and unique biological characteristics
            ----allow us to adapt to environmental stress. Has this stopped genetic evolution?
            The Dawn of Human Matrilineal Diversity
            Reconstructing phylogenies and phenotypes: a molecular view of human evolution
            AFRICAN GENETIC DIVERSITY: Implications for Human Demographic History, Modern
            -------------Human Origins, and


            a lot to read ... :-(

            most recent: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431732/
            Last edited by gsgs; 12-09-2012, 10:41 PM.

            Comment


            • #7
              I found this article:

              to learn about (keywords)
              haplotypes
              Fosmid-based haplotyping
              Single Individual Haplotyping (SIH)
              HapMap trio child, NA12878
              heterozygous SNPs
              phase
              diploid genomics
              regions with low linkage disequilibrium.
              unable to phase variants for which both parents are heterozygous (∼20% of SNPs).

              and why the 2 alleles at the positions (=haplotypes)
              are sorted in the hapmap-format but unsorted in
              the 1000-genomes/omni2123 data.
              ----------------------------------------------------------------------
              Human individuals are diploid, with each somatic cell containing two sets of chromosomes,
              one from each parent.

              NA12891 (Father); NA12892 (Mother); NA12878 (Daughter).
              Last edited by gsgs; 12-16-2012, 07:37 PM.

              Comment


              • #8
                from the data I conclude that males usually only have
                one allele=haplotype=phase in the X-chromosome.
                Or only one X, (paired with Y) not 2.
                Or the 2 are similar, both from the mother, whatever.
                I never learned it, may have to look it up
                (any good short source recommended ?
                wikipedia is usuall good, IMO)

                Anyway, no phasing problem in male X.
                So I took the 132 full male Xes from hapmap
                and made my mutation graphics



                only positions where date from >32 of the 132 are available,
                I took the 1st allele, but the 2nd is usually (99.3%) the same.

                ----------------------------------------

                picture of genetic diversity of the X-chromosome


                132 male hapmap individuals
                39 ceu ,22 chb , 22 jpt ,49 yri (in this order, top to bottom in each of the
                163 groups in the picture)
                you can print it on 21 sheets of paper
                163*720=117000 SNPs

                black pixel, iff the nucleotide is different to the average of the 132
                at that position. Orange pixel else.
                indivituals are lines, positions are columns
                Last edited by gsgs; 12-19-2012, 10:23 PM.

                Comment


                • #9
                  hah, it seems that they do have phasing information !



                  why did noone tell me ...

                  When I started this thread I had no idea about phasing, that humans
                  have two sets of DNA in each cell, one from father, one from mother.

                  Irritating was also that they use "phase" also for the
                  timely periods of the project.

                  And "haplotypes" for these 2 parental sets of DNA in the cell-nuclei
                  as well as for the population groups ?

                  Anyway.

                  They have 2022 "Total haplotypes in phased files" and (of course ?!?!)
                  we want them in that big binary array so we can work with it.
                  I hereby call it "the canonical SNP matrix (CSM)".
                  Until I find an existing name or a better name.
                  of a set of DNAs.

                  I doubt that I can finish the conversion this year, but who wants
                  can get the 500MB that I already have.
                  250*4170261 unphased (random) CSM from hapmap
                  2123*2177885 phase 1 (=haplotype 1) CSM from 1000-genomes/omni/haplotypes
                  2123*2177885 phase 2 (=haplotype 2) [=mother ?] CSM as above
                  positions,statistics,... to be completed
                  ~550 MB in total compressed with 7zip on Win32 micro SD

                  -------------update----------------
                  it seems that we get another factor of (at least) 2 with compressing these files,
                  so divide all these compressed MB by 2 !
                  So, expect a compression rate of 1% (1:100) or better for typical, big
                  *.vcf files of human DNA SNPs

                  ---------------edit 2012.12.30-------------------
                  looking at the amount of hapmap data in the /phased/phaseIII directories,
                  I think it's probably less than 1000-genomes data, which is also phased.
                  So maybe not worth to be downloaded + converted, which looks like
                  much work because of the multiple directories and files for each population
                  and relatives type. E.g.
                  chr21, YRI , ~130haplotypes*19306SNPs
                  while the hapmap data earlier in this thread has 84*51198, although unphased.
                  And 1000-genomes has 2*2123*30000 pixels in the CSM for YRI,chr21
                  some of which could be redundant (parent and child) though
                  Last edited by gsgs; 12-29-2012, 12:54 AM.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X