Hi all. I'm wondering if any studies have been done on the statistics of human genomic data. In particular, on the distribution of bases and k-mers in human dna.
For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?
Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)
Cheers!
For example, do G, C, A and T occur with equal frequency among the 3 gigabases? Are there known "long" k-mers that occur with high frequency (I don't care if they code or not)?
Alternatively, if this stuff hasn't been well-studied, are there BAM files with complete genomes that are freely available? (I don't need CIGAR, or Phred scores, just the bases.)
Cheers!
Comment