Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Seeking statistics on genomic data

    Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.

    For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?

    Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)

    Cheers!

  • #2
    I don't know of any studies off hand, but I'm sure at least some of that has been looked at. For example, there's a "CG percent" track in the UCSC browser so someone's studied that.

    You can download the human genome in a few different formats: http://hgdownload.cse.ucsc.edu/downloads.html#human

    Comment


    • #3
      Originally posted by Fixee View Post
      Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.

      For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?

      Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)

      Cheers!
      I did the A,C,G,T, and N frequencies analysis a while back and for the 3 gigabases human reference genome including random and unknown, I get:

      A 27.25%
      C 18.9%
      G 18.9%
      T 27.28%
      N 7.64%

      I haven't done the long k-mers analysis.

      Q

      Comment


      • #4
        I second the results given by qtrinh (with slightly different rounding). I used the hg19.2bit file from UCSC.

        Total: 3,137,161,264 bases

        A: 854,963,149 bases (27.25%)
        C: 592,966,724 bases (18.90%)
        G: 593,325,228 bases (18.91%)
        T: 856,055,361 bases (27.29%)
        N: 239,850,802 bases (07.65%)

        This is just the results of one strand. If you count bases from both strands, then A = A+T, T = A+T, C = C+G, G = C+G from base complementarity.

        There has been work done to find over-represented patterns (a.k.a. motifs) in DNA using in-silico (computational) methods. These motif finding tools can be used to find biologically interesting patterns like transcription factor binding sites and paralagous genes. One example would be the random projection method (http://www.ncbi.nlm.nih.gov/pubmed/12015879) which starts its search by hashing k-mer sequences.

        I wish I could remember the paper, but I saw a graphic where they represent genomes as random walks. They start at the origin and move up 1 if the next base is an A, down 1 if it is a T, to the left if it is a C, and to the right if it is a G. If the distribution was random, you would expect a random walk. The genome is not completely random due to things like genes, CG islands, and repetitive regions. I know that people use hidden-markov models to model the distributions of DNA but am not too familiar with specific techniques.

        Comment


        • #5
          Originally posted by Fixee View Post
          Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.
          Yes, people have been looking at that for a while actually.
          Since the early 80s for instance, statistics on base and kmer frequencies have been used in the domain of gene finding (detection of coding regions in the genome).

          Originally posted by Fixee View Post
          For example, do G, C, A and T occur with equal frequency among the 3 gigabases?
          No they don't. And within each genome (especially in higher eukaryotes) you can find huge discrepancies. Look for "isochores" for instance (GC-rich regions in the human genome).

          Originally posted by Fixee View Post
          Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?
          Well you should The coding property is the strongest constraint on the primary sequence, and your statistics mainly depend on the type of region you are looking at. In the domain of "gene finding" you will find a lot of relevant literature. Just for fun, here are a couple of references, from the oldest:

          - Grantham et al, 1980
          - Fickett, 1982
          - Staden and McLahan, 1982
          - Gribskov et al, 1984 (codon usage)
          - Claverie and Bougueleret 1986 (k-mer frequencies)
          - Fickett and Tung, 1992 (kmers)

          Then in the 90s people started modeling k-mer frequencies using probabilistic models like Markov Chains, but this is a long story..

          Oh, and don't forget that almost half of the genome is made of "repeated" regions. For instance look for the "Alu" sequence. Over-represented k-mers may correspond to these ones..

          Comment


          • #6
            Also, a recent one:

            Error correction of high-throughput sequencing datasets with
            non-uniform coverage
            Paul Medvedev1,∗, Eric Scott2, Boyko Kakaradov2 and Pavel Pevzner1

            Bioinfornmatics
            Vol. 27 ISMB 2011, pages i137–i141
            doi:10.1093/bioinformatics/btr208

            They definitely look at k-mers there

            Comment


            • #7
              Originally posted by Fixee View Post
              Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.

              For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?

              Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)

              Cheers!
              Many, many, many, many times.

              The genome isn't just a random assortment of nucleotides. In fact, if you look at the ratio of nucleotides to each other in coding regions compared to the whole genome, you'll see a dramatic difference (coding regions are GC rich). Things get more interesting if you start looking at multiple genomes and generating statistics related to transition:transversion ratio (closely linked with species) and indel size distribution between regions, etc.

              I assume you can obtain data from any of a number of publicly available next-gen sequences. Most non-clinical sequencing study results are freely available (1000 genomes comes to mind).
              Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
              Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
              Projects: U87MG whole genome sequence [Website] [Paper]

              Comment


              • #8
                I wish I could remember the paper, but I saw a graphic where they represent genomes as random walks.
                The paper describing DNA walks can be found here.

                Also, here is a review (somewhat old) of some of the visualization methods for analyzing DNA.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                66 views
                0 likes
                Last Post seqadmin  
                Working...
                X