In the genes.gtf from iGenomes (Ensembl GRCh37) there are cases where one tss_id is assigned to transcripts with different transcription start sites (see one example below). I've found 27 tss_ids that correspond to multiple tss coordinates.
I've also seen the reverse: the same transcription start site coordinates (from different transcripts) assigned different tss_ids. I've found 21 of these.
Are these mistakes in the genes.gtf files or is there some logic behind this?
Can anyone shed light on this?
Thanks
t
====
An example of one tss_id assigned to different transcription start sites:
6 protein_coding exon 100009473 100009534 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P51299"; transcript_id "ENST00000486428"; transcript_name "CCNC-002"; tss_id "TSS132831";
6 protein_coding exon 100016054 100016103 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P45518"; transcript_id "ENST00000524049"; transcript_name "CCNC-012"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016499 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P64177"; transcript_id "ENST00000369217"; transcript_name "CCNC-004"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016504 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P29827"; transcript_id "ENST00000369220"; transcript_name "CCNC-011"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016504 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; transcript_id "ENST00000521017"; transcript_name "CCNC-016"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016507 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P13758"; transcript_id "ENST00000326298"; transcript_name "CCNC-010"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016522 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P71574"; transcript_id "ENST00000523961"; transcript_name "CCNC-009"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016528 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P883"; transcript_id "ENST00000484049"; transcript_name "CCNC-006"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016536 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P29986"; transcript_id "ENST00000523985"; transcript_name "CCNC-008"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016538 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P74717"; transcript_id "ENST00000518714"; transcript_name "CCNC-003"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016589 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P74717"; transcript_id "ENST00000520371"; transcript_name "CCNC-007"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016643 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; transcript_id "ENST00000523639"; transcript_name "CCNC-017"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016849 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P27383"; transcript_id "ENST00000520429"; transcript_name "CCNC-001"; tss_id "TSS132831";
6 protein_coding exon 100016594 100016701 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P29986"; transcript_id "ENST00000523799"; transcript_name "CCNC-013"; tss_id "TSS132831";
=====
And an example of the reverse (1 TSS -> Multiple tss_ids):
16 protein_coding exon 1020788 1020982 . - . exon_number "1"; gene_id "ENSG00000103227"; gene_name "LMF1"; p_id "P22127"; transcript_id "ENST00000262301"; transcript_name "LMF1-001"; tss_id "TSS95890";
16 protein_coding exon 1020788 1020982 . - . exon_number "1"; gene_id "ENSG00000103227"; gene_name "LMF1"; p_id "P63891"; transcript_id "ENST00000539151"; transcript_name "LMF1-202"; tss_id "TSS76695";
16 protein_coding exon 1020788 1020982 . - . exon_number "1"; gene_id "ENSG00000103227"; gene_name "LMF1"; p_id "P5377"; transcript_id "ENST00000545827"; transcript_name "LMF1-206"; tss_id "TSS100452";
16 protein_coding exon 1020788 1020982 . - . exon_number "1"; gene_id "ENSG00000103227"; gene_name "LMF1"; p_id "P40790"; transcript_id "ENST00000399843"; transcript_name "LMF1-201"; tss_id "TSS101232";
I've also seen the reverse: the same transcription start site coordinates (from different transcripts) assigned different tss_ids. I've found 21 of these.
Are these mistakes in the genes.gtf files or is there some logic behind this?
Can anyone shed light on this?
Thanks
t
====
An example of one tss_id assigned to different transcription start sites:
6 protein_coding exon 100009473 100009534 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P51299"; transcript_id "ENST00000486428"; transcript_name "CCNC-002"; tss_id "TSS132831";
6 protein_coding exon 100016054 100016103 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P45518"; transcript_id "ENST00000524049"; transcript_name "CCNC-012"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016499 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P64177"; transcript_id "ENST00000369217"; transcript_name "CCNC-004"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016504 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P29827"; transcript_id "ENST00000369220"; transcript_name "CCNC-011"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016504 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; transcript_id "ENST00000521017"; transcript_name "CCNC-016"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016507 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P13758"; transcript_id "ENST00000326298"; transcript_name "CCNC-010"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016522 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P71574"; transcript_id "ENST00000523961"; transcript_name "CCNC-009"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016528 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P883"; transcript_id "ENST00000484049"; transcript_name "CCNC-006"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016536 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P29986"; transcript_id "ENST00000523985"; transcript_name "CCNC-008"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016538 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P74717"; transcript_id "ENST00000518714"; transcript_name "CCNC-003"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016589 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P74717"; transcript_id "ENST00000520371"; transcript_name "CCNC-007"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016643 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; transcript_id "ENST00000523639"; transcript_name "CCNC-017"; tss_id "TSS132831";
6 protein_coding exon 100016372 100016849 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P27383"; transcript_id "ENST00000520429"; transcript_name "CCNC-001"; tss_id "TSS132831";
6 protein_coding exon 100016594 100016701 . - . exon_number "1"; gene_id "ENSG00000112237"; gene_name "CCNC"; p_id "P29986"; transcript_id "ENST00000523799"; transcript_name "CCNC-013"; tss_id "TSS132831";
=====
And an example of the reverse (1 TSS -> Multiple tss_ids):
16 protein_coding exon 1020788 1020982 . - . exon_number "1"; gene_id "ENSG00000103227"; gene_name "LMF1"; p_id "P22127"; transcript_id "ENST00000262301"; transcript_name "LMF1-001"; tss_id "TSS95890";
16 protein_coding exon 1020788 1020982 . - . exon_number "1"; gene_id "ENSG00000103227"; gene_name "LMF1"; p_id "P63891"; transcript_id "ENST00000539151"; transcript_name "LMF1-202"; tss_id "TSS76695";
16 protein_coding exon 1020788 1020982 . - . exon_number "1"; gene_id "ENSG00000103227"; gene_name "LMF1"; p_id "P5377"; transcript_id "ENST00000545827"; transcript_name "LMF1-206"; tss_id "TSS100452";
16 protein_coding exon 1020788 1020982 . - . exon_number "1"; gene_id "ENSG00000103227"; gene_name "LMF1"; p_id "P40790"; transcript_id "ENST00000399843"; transcript_name "LMF1-201"; tss_id "TSS101232";
Comment