Hi all,
I have a question about the annotation of features in some GTF files. So basically, I've seen that quite often the CDS and exon features assigned to a certain exon have different start positions. So for example, within a gene transcript (importantly the same transcript) there can be a number of features annotated to exon 3. This may include CDS and exon, however I've noticed that these may differ in start position. As this is not the first or last exon, I guess it is not the 3 or 5 prime UTR, so what is it?
I know CDS is the coding sequence, and exon is the exon but if part of the exon does to code, why is it included as opposed to having both the exon and CDS at the same position. Other than a coding region/feature.
Also, in the below example of what I mean, why is the start codon in exon 3? Why would this not be the first exon of that transcript if coding only starts there? Why are the previous exons annotated?
Any help on this would be great as I'm trying to write a script to pull out the exome using a GTF.
regards,
Anthony
I have a question about the annotation of features in some GTF files. So basically, I've seen that quite often the CDS and exon features assigned to a certain exon have different start positions. So for example, within a gene transcript (importantly the same transcript) there can be a number of features annotated to exon 3. This may include CDS and exon, however I've noticed that these may differ in start position. As this is not the first or last exon, I guess it is not the 3 or 5 prime UTR, so what is it?
I know CDS is the coding sequence, and exon is the exon but if part of the exon does to code, why is it included as opposed to having both the exon and CDS at the same position. Other than a coding region/feature.
Also, in the below example of what I mean, why is the start codon in exon 3? Why would this not be the first exon of that transcript if coding only starts there? Why are the previous exons annotated?
Any help on this would be great as I'm trying to write a script to pull out the exome using a GTF.
Code:
chr3 protein_coding exon 195880 195990 . + . gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "1"; gene_biotype "protein_coding"; chr3 protein_coding exon 202306 202479 . + . gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "2"; gene_biotype "protein_coding"; chr3 protein_coding exon 204057 204213 . + . gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "3"; gene_biotype "protein_coding"; chr3 protein_coding CDS 204069 204213 . + 0 gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "3"; gene_biotype "protein_coding"; protein_id "ENSBTAP00000051775"; chr3 protein_coding start_codon 204069 204071 . + 0 gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "3"; gene_biotype "protein_coding"; chr3 protein_coding exon 206914 208046 . + . gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "4"; gene_biotype "protein_coding"; chr3 protein_coding CDS 206914 208046 . + 2 gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "4"; gene_biotype "protein_coding"; protein_id "ENSBTAP00000051775"; chr3 protein_coding exon 208701 208733 . + . gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "5"; gene_biotype "protein_coding"; chr3 protein_coding CDS 208701 208733 . + 0 gene_id "ENSBTAG00000000584"; transcript_id "ENSBTAT00000056645"; exon_number "5"; gene_biotype "protein_coding"; protein_id "ENSBTAP00000051775";
Anthony
Comment