Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • converting genomic coordinates to transcript coordinates

    Hello,
    I need to convert the genomic coordinates for UTR and CDS features in the arabidopsis gff3 gene annotations to transcript coordinates. Can anyone suggest a way to do this using awk or other languages? I have tried to first calculate the length of the features for each ID, so that I can calculate the feature start and end positions for each transcript based on the length. However I am not sure how to write the code for the second part. I have searched for related threads but none of the solution seemed to provide what I want. Is there already a package or script for this kind of conversion? Any suggestion is most welcome!

    Here is a sample of my input data:
    Code:
    more Athaliana_167_TAIR10.gene.gff3 
    ##gff-version 3
    ##annot-version TAIR10
    Chr1    phytozomev10    gene    3631    5899    .       +       .       ID=AT1G01010.TAIR10;Name=AT1G01010
    Chr1    phytozomev10    mRNA    3631    5899    .       +       .       ID=AT1G01010.1.TAIR10;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010.TAIR10
    Chr1    phytozomev10    five_prime_UTR  3631    3759    .       +       .       ID=AT1G01010.1.TAIR10.five_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
    Chr1    phytozomev10    CDS     3760    3913    .       +       0       ID=AT1G01010.1.TAIR10.CDS.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
    Chr1    phytozomev10    CDS     3996    4276    .       +       2       ID=AT1G01010.1.TAIR10.CDS.2;Parent=AT1G01010.1.TAIR10;pacid=19656964
    Chr1    phytozomev10    CDS     4486    4605    .       +       0       ID=AT1G01010.1.TAIR10.CDS.3;Parent=AT1G01010.1.TAIR10;pacid=19656964
    Chr1    phytozomev10    CDS     4706    5095    .       +       0       ID=AT1G01010.1.TAIR10.CDS.4;Parent=AT1G01010.1.TAIR10;pacid=19656964
    Chr1    phytozomev10    CDS     5174    5326    .       +       0       ID=AT1G01010.1.TAIR10.CDS.5;Parent=AT1G01010.1.TAIR10;pacid=19656964
    Chr1    phytozomev10    CDS     5439    5630    .       +       0       ID=AT1G01010.1.TAIR10.CDS.6;Parent=AT1G01010.1.TAIR10;pacid=19656964
    Chr1    phytozomev10    three_prime_UTR 5631    5899    .       +       .       ID=AT1G01010.1.TAIR10.three_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
    Chr1    phytozomev10    gene    5928    8737    .       -       .       ID=AT1G01020.TAIR10;Name=AT1G01020
    Chr1    phytozomev10    mRNA    5928    8737    .       -       .       ID=AT1G01020.1.TAIR10;Name=AT1G01020.1;pacid=19655142;longest=1;Parent=AT1G01020.TAIR10
    Chr1    phytozomev10    CDS     8571    8666    .       -       0       ID=AT1G01020.1.TAIR10.CDS.1;Parent=AT1G01020.1.TAIR10;pacid=19655142
    Here is my working code based on the thread here for calculating the UTR and CDS feature length for each transcript:

    Code:
    awk '$3!="gene" && $3!="mRNA"' <Athaliana_167_TAIR10.gene.gff3|awk  '$9~/Parent/{sub(/.*Parent=/,"",$9);sub(/.TAIR10;.*/,"",$9);ID=$9;L[ID]=$5-$4+1}{for(i in L){print $0,L[i]}}'

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 11:49 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X