Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 1000 genomes data format

    paper , 10 pages .pdf:

    supplementary material, 113 pages .pdf:

    ==============================================================


    after starting with the hapmap data (which looked easier) in this thread:
    Any topic/question that does not fit into the subcategories below. If you're unsure of where to put something, ask in here!

    I arrived (again) at the 1000 genomes data which apparently is more extensive.


    1000genomes,2012 paper,page 60,suppl.
    10.5 Haplotype estimation from OMNI data
    2123
    327 trios,42 duos, 1058 singles
    2177885 SNPs (-->~4.4 times more data than hapmap with 250 individuals at ~4M SNPs)

    no x,y,m chromosomes

    4 bytes per (individual,position) expanded : "1|1","1|0","0|1","0|0" plus ascii 009
    so, only 2 bits and that corresponds to the compression rate.

    11,10,01,00 could correspond to the 2-letter entries in the hapmap data, where the letters are given in
    columns 2,3 and 1,0 indicate which one of them is chosen in which Zygote
    however hapmap would either have 01 or 10, never both (Zygotes indistinguishable)

    anyway, you choose one of the values at random and get that binary "diversity-matrix",
    here of size 2123x2177885, 578MB
    as compared to 250x4170000, 130MB for hapmap
    Well, it's not really binary since there are empty positions (~10%)
    I think we should somehow fill them in a random, unbiased way so to preserve the structure
    and statistical content.

    1000genomes.org is your first and best source for all of the information you’re looking for. From general topics to more of what you would expect to find here, 1000genomes.org has it all. We hope you find what you are searching for!

    page 4, Genotype fields , oGT
    > If genotype information is present, then the same types of data must be present for all samples.
    > First a FORMAT field is given specifying the data types and order (colon-separated alphanumeric
    > String). This is followed by one field per sample, with the colon-separated data in this field
    > corresponding to the types specified in the format. The first sub-field must always be the
    > genotype (GT) if it is present. There are no required sub-fields.
    Last edited by gsgs; 12-19-2012, 05:39 PM.

  • #2
    anyway, it seems that you always arrive at such a binary "diversity-matrix",
    individuals over SNP-positions(all chromosomes)
    of size 2123x2177885, 578MB for 1000 genomes
    of size 250x4170000, 130MB for hapmap
    The bit at position (x,y) is set, iff sequence x differs at position y from the consensus
    (or average) in one of the 2 diploid alleles, chosen at random


    We just need that giant binary matrix, how is it called ? Is there a math-theory
    about its properties,manipulation,relation to other objects,... already ?
    Shouldn't they offer that matrix for download directly ? (-->easier)
    Last edited by gsgs; 12-11-2012, 10:05 AM.

    Comment


    • #3
      there are 445.6 TB (358954 files , 20992 directories) listed in current_tree on the ftp-site
      ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
      most (99.6%) TB are in the 3 data subdirs and the 3 technical subdirs :

      Code:
      subdir,files,terabytes
      data:
      main:153136,114.6
      phas:013311,056.7
      pilo:035889,013.1
      ---------------------------
      data:202336files,184.4TB
       
      technical:
      main:144568,218.9
      phas:007899,040.1
      pilo:000930,000.3
      --------------------------
      tech:153397files,259.3TB
       
      tota:358954files,445.6TB
      ftp:/data has 2456 genomenames and thus 2456 subdirectories HG00096...NA21144,
      which in turn have 3 subdirectories each , alignment,exome_alignment,sequence_read
      In total there are 2-3021 files in those 3 subdirectories.
      82 have more than 288 files
      192 have more than 253 files
      322 have more than 188 files
      Last edited by gsgs; 12-17-2012, 02:31 AM.

      Comment


      • #4
        I finished downloading and converting the SNP-data
        from the 22 chromosomes at
        /omni/haplotypes
        2123 individuals, 1.1GB compressed, 18.5 GB expanded

        SNPs in the 22 available chromosomes:
        + 174234 + 182881 + 153373 + 142614 + 136754 + 136611
        + 121998 + 118385 + 097989 + 112704 + 109763 + 105824
        + 077675 + 072264 + 068091 + 073797 + 064155 + 064176
        + 047536 + 054241 + 030040 + 032780 = 2177885

        2*4.6 GB for the 2*23 fasta files uncompressed ~0.9GB compressed

        (~450000 GB total 1000-genome data

        Comment


        • #5
          does anyone have an explanation why the number variants vary significantly between release/20100804 and 20110521 for any sample. I looked for the variants listed for NA10851 sample from ftp://ftp.1000genomes.ebi.ac.uk/vol1...lease/20110521 and
          ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz, and found 39.7 and 14.6 million variants respectively. I expected to see some differences on genome locations but was surprised to see big difference in the number of calls. I was wondering if any one knows what factors other than difference in the genome build could be accounted for such huge difference. thanks in advance for your kind help.
          Last edited by rama; 01-07-2013, 08:38 AM.

          Comment


          • #6
            Originally posted by gsgs View Post
            anyway, it seems that you always arrive at such a binary "diversity-matrix",
            individuals over SNP-positions(all chromosomes)
            of size 2123x2177885, 578MB for 1000 genomes
            of size 250x4170000, 130MB for hapmap
            The bit at position (x,y) is set, iff sequence x differs at position y from the consensus
            (or average) in one of the 2 diploid alleles, chosen at random


            We just need that giant binary matrix, how is it called ? Is there a math-theory
            about its properties,manipulation,relation to other objects,... already ?
            Shouldn't they offer that matrix for download directly ? (-->easier)

            I found it here:



            "HapZipper" , a paper from 2012, but no followup :-(

            Their purpose is (was) to compress the data, rather than to analyse,manipulate,characterize,search,order,share
            them efficiently


            now you can reorder the rows [individuals] and columns [SNP-locations] of that matrix
            so to achieve the smallest sum of differences between neighbor rows and columns
            using some "traveling salesman" algorithm and then offer that matrix as a giant
            zoomable .pdf picture as a picture of human genetic diversity - to be compared with
            other species

            Who will do it (first) ?


            [ I was trying to download genbank's dbSNP data, but couldn't figure out
            the format, what to download, how to convert it.
            https://en.wikipedia.org/wiki/DbSNP ]

            I'd like to have the HapZipper matrix of these 2280 human public domain genomes:
            Cheap sequencing has driven the proliferation of big human genome data aggregation consortiums, providing extensive reference datasets for genome research. These datasets, however, may come with restrictive terms of use, conditioned by the consent frameworks within which individuals donate their data. Having an aggregated genome dataset with unrestricted use, analogous to public domain licensing, is therefore unusually rare. Yet public domain data is tremendously useful because it allows freedom to perform research with it. This comes with the price of donors surrendering their privacy and accepting the associated risks derived from publishing personal data. Using the Repositive platform (), an indexing service for human genome datasets, we aggregated all deposited files in public data sources under a CC0 license from 23andMe, a leading Direct-to-Consumer genetic testing service. After downloading 3,137 genotypes, we filtered out those that were incomplete, corrupt or duplicated, ending up with a dataset of 2,280 curated files, each one corresponding to a unique individual. Although the size of this dataset is modest compared to current major genome data aggregation projects, its full access and licensing terms, which allows free reuse without attribution, make it a useful reference pool for validation purposes and control experiments.
            Last edited by gsgs; 08-29-2017, 12:28 AM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            66 views
            0 likes
            Last Post seqadmin  
            Working...
            X