I recently downloaded multiple versions of human gene location annotation for different genome builds from NCBI MapView ftp site:
ftp://ftp.ncbi.nih.gov/genomes/MapVi...seq_gene.md.gz
ftp://ftp.ncbi.nih.gov/genomes/MapVi...seq_gene.md.gz
ftp://ftp.ncbi.nih.gov/genomes/MapVi...seq_gene.md.gz
ftp://ftp.ncbi.nih.gov/genomes/MapVi...seq_gene.md.gz
Below is the overall statistics on the number of genes from each build based on these data.
Build\Counts total unique
37.3 36500 36451
36.3 96694 35171
35.1 46159 27281
34.3 28401 27455
I notice that the file for Build.36.3 have much more replicate gene entries than all other builds. In fact, each gene has about 3 different locations in Build.36.3. It looks like that a gene with multiple locations usually suggest that gene aligns across a gap in the reference genome assembly or misassembly of the reference. For example: http://www.ncbi.nlm.nih.gov/gene/6011.
If this is true, then Build.36.3 is a reference genome with at least tens of thousands of misassembly, much more errors than all other assemblies. It sound unbelievable given this build.36.3 (or hg18) used to be so widely used. Any hint on what’s the problem? Thanks!
ftp://ftp.ncbi.nih.gov/genomes/MapVi...seq_gene.md.gz
ftp://ftp.ncbi.nih.gov/genomes/MapVi...seq_gene.md.gz
ftp://ftp.ncbi.nih.gov/genomes/MapVi...seq_gene.md.gz
ftp://ftp.ncbi.nih.gov/genomes/MapVi...seq_gene.md.gz
Below is the overall statistics on the number of genes from each build based on these data.
Build\Counts total unique
37.3 36500 36451
36.3 96694 35171
35.1 46159 27281
34.3 28401 27455
I notice that the file for Build.36.3 have much more replicate gene entries than all other builds. In fact, each gene has about 3 different locations in Build.36.3. It looks like that a gene with multiple locations usually suggest that gene aligns across a gap in the reference genome assembly or misassembly of the reference. For example: http://www.ncbi.nlm.nih.gov/gene/6011.
If this is true, then Build.36.3 is a reference genome with at least tens of thousands of misassembly, much more errors than all other assemblies. It sound unbelievable given this build.36.3 (or hg18) used to be so widely used. Any hint on what’s the problem? Thanks!
Comment