Hello,
I need to convert the genomic coordinates for UTR and CDS features in the arabidopsis gff3 gene annotations to transcript coordinates. Can anyone suggest a way to do this using awk or other languages? I have tried to first calculate the length of the features for each ID, so that I can calculate the feature start and end positions for each transcript based on the length. However I am not sure how to write the code for the second part. I have searched for related threads but none of the solution seemed to provide what I want. Is there already a package or script for this kind of conversion? Any suggestion is most welcome!
Here is a sample of my input data:
Here is my working code based on the thread here for calculating the UTR and CDS feature length for each transcript:
I need to convert the genomic coordinates for UTR and CDS features in the arabidopsis gff3 gene annotations to transcript coordinates. Can anyone suggest a way to do this using awk or other languages? I have tried to first calculate the length of the features for each ID, so that I can calculate the feature start and end positions for each transcript based on the length. However I am not sure how to write the code for the second part. I have searched for related threads but none of the solution seemed to provide what I want. Is there already a package or script for this kind of conversion? Any suggestion is most welcome!
Here is a sample of my input data:
Code:
more Athaliana_167_TAIR10.gene.gff3 ##gff-version 3 ##annot-version TAIR10 Chr1 phytozomev10 gene 3631 5899 . + . ID=AT1G01010.TAIR10;Name=AT1G01010 Chr1 phytozomev10 mRNA 3631 5899 . + . ID=AT1G01010.1.TAIR10;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010.TAIR10 Chr1 phytozomev10 five_prime_UTR 3631 3759 . + . ID=AT1G01010.1.TAIR10.five_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964 Chr1 phytozomev10 CDS 3760 3913 . + 0 ID=AT1G01010.1.TAIR10.CDS.1;Parent=AT1G01010.1.TAIR10;pacid=19656964 Chr1 phytozomev10 CDS 3996 4276 . + 2 ID=AT1G01010.1.TAIR10.CDS.2;Parent=AT1G01010.1.TAIR10;pacid=19656964 Chr1 phytozomev10 CDS 4486 4605 . + 0 ID=AT1G01010.1.TAIR10.CDS.3;Parent=AT1G01010.1.TAIR10;pacid=19656964 Chr1 phytozomev10 CDS 4706 5095 . + 0 ID=AT1G01010.1.TAIR10.CDS.4;Parent=AT1G01010.1.TAIR10;pacid=19656964 Chr1 phytozomev10 CDS 5174 5326 . + 0 ID=AT1G01010.1.TAIR10.CDS.5;Parent=AT1G01010.1.TAIR10;pacid=19656964 Chr1 phytozomev10 CDS 5439 5630 . + 0 ID=AT1G01010.1.TAIR10.CDS.6;Parent=AT1G01010.1.TAIR10;pacid=19656964 Chr1 phytozomev10 three_prime_UTR 5631 5899 . + . ID=AT1G01010.1.TAIR10.three_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964 Chr1 phytozomev10 gene 5928 8737 . - . ID=AT1G01020.TAIR10;Name=AT1G01020 Chr1 phytozomev10 mRNA 5928 8737 . - . ID=AT1G01020.1.TAIR10;Name=AT1G01020.1;pacid=19655142;longest=1;Parent=AT1G01020.TAIR10 Chr1 phytozomev10 CDS 8571 8666 . - 0 ID=AT1G01020.1.TAIR10.CDS.1;Parent=AT1G01020.1.TAIR10;pacid=19655142
Code:
awk '$3!="gene" && $3!="mRNA"' <Athaliana_167_TAIR10.gene.gff3|awk '$9~/Parent/{sub(/.*Parent=/,"",$9);sub(/.TAIR10;.*/,"",$9);ID=$9;L[ID]=$5-$4+1}{for(i in L){print $0,L[i]}}'