Quote:
Originally Posted by dpryan
1. Use things like "cut" and "uniq" to determine this. This isn't something you need to look up, just determine it yourself.
|
Well, there is no description in the tables I downloaded from UCSC. Otherwise, I could indeed check as I did with Genecode annotation. Here are some file header:
refFlat file (RefSeq), that I used until now:
Code:
chr1 hg19_refFlat exon 11874 12227 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1 hg19_refFlat exon 12613 12721 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1 hg19_refFlat exon 13221 14409 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1 hg19_refFlat exon 14362 14829 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 14970 15038 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 15796 15947 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
refGene file (RefSeq):
Code:
chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 66999639 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
knownGenes file (UCSC):
Code:
chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
chr1 hg19_knownGene exon 12646 12697 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene CDS 12595 12721 0.000000 + 1 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene exon 12595 12721 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene CDS 13403 13636 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene stop_codon 13637 13639 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene exon 13403 14409 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
There might exist one file with comprehensive description.
Quote:
Originally Posted by dpryan
2. How does one define a gene? Is it a location, a sequence, something else? If you have essentially the same sequence on different chromosomes and both are expressed are they the same gene or different ones? In such cases, gencode/ensembl will give each instance a unique ID. UCSC will give each instance the same ID in such cases, which is a good way to completely break a LOT of programs.This is why one should normally quantify by gene ID. You can add gene names after everything is analysed.
|
Thank you for the explanation.
Quote:
Originally Posted by dpryan
3. UCSC annotations are rather minimalistic.
|
Ok, but I am very surprised for annotation of well known genes.