View Single Post
Old 05-12-2016, 06:35 AM   #10
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Quote:
Originally Posted by dpryan View Post
1. Use things like "cut" and "uniq" to determine this. This isn't something you need to look up, just determine it yourself.
Well, there is no description in the tables I downloaded from UCSC. Otherwise, I could indeed check as I did with Genecode annotation. Here are some file header:

refFlat file (RefSeq), that I used until now:

Code:
chr1	hg19_refFlat	exon	11874	12227	0.000000	+	.	gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1	hg19_refFlat	exon	12613	12721	0.000000	+	.	gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1	hg19_refFlat	exon	13221	14409	0.000000	+	.	gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1	hg19_refFlat	exon	14362	14829	0.000000	-	.	gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1	hg19_refFlat	exon	14970	15038	0.000000	-	.	gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1	hg19_refFlat	exon	15796	15947	0.000000	-	.	gene_id "WASH7P"; transcript_id "WASH7P";

refGene file (RefSeq):
Code:
chr1	hg19_refGene	start_codon	67000042	67000044	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291"; 
chr1	hg19_refGene	CDS	67000042	67000051	0.000000	+	0	gene_id "NM_032291"; transcript_id "NM_032291"; 
chr1	hg19_refGene	exon	66999639	67000051	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291"; 
chr1	hg19_refGene	CDS	67091530	67091593	0.000000	+	2	gene_id "NM_032291"; transcript_id "NM_032291"; 
chr1	hg19_refGene	exon	67091530	67091593	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291"; 
chr1	hg19_refGene	CDS	67098753	67098777	0.000000	+	1	gene_id "NM_032291"; transcript_id "NM_032291"; 
chr1	hg19_refGene	exon	67098753	67098777	0.000000	+	.	gene_id "NM_032291"; transcript_id "NM_032291"; 
chr1	hg19_refGene	CDS	67101627	67101698	0.000000	+	0	gene_id "NM_032291"; transcript_id "NM_032291";
knownGenes file (UCSC):
Code:
chr1	hg19_knownGene	exon	11874	12227	0.000000	+	.	gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
chr1	hg19_knownGene	exon	12646	12697	0.000000	+	.	gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
chr1	hg19_knownGene	exon	13221	14409	0.000000	+	.	gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
chr1	hg19_knownGene	start_codon	12190	12192	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1	hg19_knownGene	CDS	12190	12227	0.000000	+	0	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1	hg19_knownGene	exon	11874	12227	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1	hg19_knownGene	CDS	12595	12721	0.000000	+	1	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1	hg19_knownGene	exon	12595	12721	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1	hg19_knownGene	CDS	13403	13636	0.000000	+	0	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1	hg19_knownGene	stop_codon	13637	13639	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1	hg19_knownGene	exon	13403	14409	0.000000	+	.	gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
There might exist one file with comprehensive description.

Quote:
Originally Posted by dpryan View Post
2. How does one define a gene? Is it a location, a sequence, something else? If you have essentially the same sequence on different chromosomes and both are expressed are they the same gene or different ones? In such cases, gencode/ensembl will give each instance a unique ID. UCSC will give each instance the same ID in such cases, which is a good way to completely break a LOT of programs.This is why one should normally quantify by gene ID. You can add gene names after everything is analysed.
Thank you for the explanation.

Quote:
Originally Posted by dpryan View Post
3. UCSC annotations are rather minimalistic.
Ok, but I am very surprised for annotation of well known genes.
Jane M is offline   Reply With Quote