Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Getting transcript lengths from GFF file

    Hello
    I have the maize reference annotation downloaded from http://ftp.maizesequence.org/current/working-set/

    the first 10 lines are as under:
    Code:
    UNKNOWN	ensembl	chromosome	1	14680007	.	.	.	ID=UNKNOWN;Name=chromosome:AGPv1:UNKNOWN:1:14680007:1
    UNKNOWN	ensembl	CDS	17354	17446	.	+	0	Parent=GRMZM2G524142_T01;Name=CDS.1254885
    UNKNOWN	ensembl	exon	17354	17460	.	+	.	Parent=GRMZM2G524142_T01;Name=GRMZM2G524142_E01
    UNKNOWN	ensembl	gene	17354	17460	.	+	.	ID=GRMZM2G524142;Name=GRMZM2G524142;biotype=protein_coding
    UNKNOWN	ensembl	mRNA	17354	17460	.	+	.	ID=GRMZM2G524142_T01;Parent=GRMZM2G524142;Name=GRMZM2G524142_T01;biotype=protein_coding
    UNKNOWN	ensembl	exon	17537	17976	.	-	.	Parent=GRMZM2G369735_T01;Name=GRMZM2G369735_E03
    UNKNOWN	ensembl	gene	17537	18259	.	-	.	ID=GRMZM2G369735;Name=GRMZM2G369735;biotype=protein_coding
    UNKNOWN	ensembl	mRNA	17537	18259	.	-	.	ID=GRMZM2G369735_T01;Parent=GRMZM2G369735;Name=GRMZM2G369735_T01;biotype=protein_coding
    UNKNOWN	ensembl	CDS	17716	17952	.	-	.	Parent=GRMZM2G369735_T01;Name=CDS.1254890
    UNKNOWN	ensembl	exon	17972	18045	.	-	.	Parent=GRMZM2G369735_T02;Name=GRMZM2G369735_E02
    How can I output transcript/feature length from the given genome coordinates? Is there a simple perl script out there? Or is there some script available in ENSembL API? Any help or suggestions appreciated?

    thanks
    Siva

  • #2
    First quick and dirty try:
    awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file
    I guess a much better solution will come soon..

    Comment


    • #3
      Hi Siva

      Originally posted by Siva View Post
      How can I output transcript/feature length from the given genome coordinates? Is there a simple perl script out there? Or is there some script available in ENSembL API? Any help or suggestions appreciated?
      This is a typical use case for my HTSeq Python framework.

      The following script does the job. Save it and run it with Python (after having installed the newest version, 0.4.3, of HTSeq):

      Code:
      import HTSeq
      
      gff_file = HTSeq.GFF_Reader( "ZmB73_4a.53_WGS.gff.gz", end_included=True )
      
      transcripts= {}
      
      for feature in gff_file:
         if feature.type == "exon":
            transcript_id = feature.attr['Parent']
            if transcript_id not in transcripts:
               transcripts[ transcript_id ] = list()
            transcripts[ transcript_id ].append( feature )
            
      for transcript_id in sorted( transcripts ):      
         transcript_length = 0
         for exon in transcripts[ transcript_id ]:
            transcript_length += exon.iv.length + 1
         print transcript_id, transcript_length, len( transcripts[ transcript_id ] )
      After a minute or two, it outputs a list of transcripts, each line giving the transcript ID, the length (summed over all its exons), and the number of exons.

      In order to advertise HTSeq a bit and demonstrate that it is easy to use I add here a commented version of the same script. Even if you've obtained only a minimal knowledge of Python (e.g., by reading only the first few chapters of Mark Lutz's very good book "Learning Python"), you will find that it iss easy to write such stuff yourself.

      Code:
      import HTSeq
      
      # A GFF_Reader parses a GFF file and presents its content as
      # a sequence of 'GenomicFeature' objects.
      gff_file = HTSeq.GFF_Reader( "ZmB73_4a.53_WGS.gff.gz", end_included=True )
      # In this GFF file, the 'end' coordinate is part of the feature,
      # hence 'end_included=True'.
      
      # The following dict will store, for each transcript, a list
      # of its exons
      transcripts= {}
      
      # Go through the GFF file and look for all exons:
      for feature in gff_file:
         # 'feature' is an HTSeq.GenomicFeature object, which describes a
         # line in the GFF file. It has several slots, for example, 'iv'
         # for the interval (where it is on the genome), 'type' for its
         # type (3rd column in the GFF file) and 'attr' for the
         # attributes (last column)
         # We are only interested in lines describing exons:
         if feature.type == "exon":
            # In this GFF file, the ID of the transcript to which the exons
            # belongs is stored in an attribute (i.e., something in the 9th
            # column) called "Parent"
            transcript_id = feature.attr['Parent']
            # Is this bthe first time we see this transcript?
            if transcript_id not in transcripts:
               # If so, add to the 'transcripts' dict an empty list 
               # store store the exons in
               transcripts[ transcript_id ] = list()
            # Add the exon to the list
            transcripts[ transcript_id ].append( feature )
            
      # Now, go through the transcripts and calculate the length of each
      # by adding up the lengths of its exons
      for transcript_id in sorted( transcripts ):      
         transcript_length = 0
         # Go through the GenomicFeature
         for exon in transcripts[ transcript_id ]:
            # Each GenomicFeature object has a slot called 'iv', which is a
            # GenomicInterval object, which in turn contains slots such as
            # 'chrom', 'start', 'end', 'strand', 'length', etc., to describe
            # where the feature sits in the genome. We use 'length' here.
            transcript_length += exon.iv.length + 1
            # We add the 1 because this GFF file does no
         # Now, 'transcript_length' is the sum of the lengths of all the
         # transcripts exon.
         # Print this result, and also add the number of exons as (which is just the 
         # length of the list of exons for the transcript):
         print transcript_id, transcript_length, len( transcripts[ transcript_id ] )
      All the classes mentioned here are, of course, explained in detail in HTSeq's reference documentation on the web site.

      Simon

      Comment


      • #4
        Originally posted by Siva View Post
        Hello
        I have the maize reference annotation downloaded from http://ftp.maizesequence.org/current/working-set/

        the first 10 lines are as under:
        Code:
        UNKNOWN	ensembl	chromosome	1	14680007	.	.	.	ID=UNKNOWN;Name=chromosome:AGPv1:UNKNOWN:1:14680007:1
        UNKNOWN	ensembl	CDS	17354	17446	.	+	0	Parent=GRMZM2G524142_T01;Name=CDS.1254885
        UNKNOWN	ensembl	exon	17354	17460	.	+	.	Parent=GRMZM2G524142_T01;Name=GRMZM2G524142_E01
        UNKNOWN	ensembl	gene	17354	17460	.	+	.	ID=GRMZM2G524142;Name=GRMZM2G524142;biotype=protein_coding
        UNKNOWN	ensembl	mRNA	17354	17460	.	+	.	ID=GRMZM2G524142_T01;Parent=GRMZM2G524142;Name=GRMZM2G524142_T01;biotype=protein_coding
        UNKNOWN	ensembl	exon	17537	17976	.	-	.	Parent=GRMZM2G369735_T01;Name=GRMZM2G369735_E03
        UNKNOWN	ensembl	gene	17537	18259	.	-	.	ID=GRMZM2G369735;Name=GRMZM2G369735;biotype=protein_coding
        UNKNOWN	ensembl	mRNA	17537	18259	.	-	.	ID=GRMZM2G369735_T01;Parent=GRMZM2G369735;Name=GRMZM2G369735_T01;biotype=protein_coding
        UNKNOWN	ensembl	CDS	17716	17952	.	-	.	Parent=GRMZM2G369735_T01;Name=CDS.1254890
        UNKNOWN	ensembl	exon	17972	18045	.	-	.	Parent=GRMZM2G369735_T02;Name=GRMZM2G369735_E02
        How can I output transcript/feature length from the given genome coordinates? Is there a simple perl script out there? Or is there some script available in ENSembL API? Any help or suggestions appreciated?

        thanks
        Siva
        If I misunderstood your question, I apologize. Here's a perl-way:

        < <input file> perl -e 'while(<>){ @splitted=split('\t', $_); $length= $splitted[4]-$splitted[3] + 1; $_=~s/\n//g; print $_, "\t", $length, "\n"; }' > <out file name> &

        This will add the length on the end of each line and print to <out file name>. Use & to run in the background so you can carry on with life.

        , And if you just want 'mRNA' "transcripts":

        < <input file> perl -e 'while(<>){ @splitted=split('\t', $_); $length= $splitted[4]-$splitted[3] + 1; $_=~s/\n//g; if($_=~/^.*mRNA.*\n/){print $_, "\t", $length, "\n";} }' > <out file name> &

        Only prints on lines with mRNA feature name included. Oh, and just so you can manipulate the above cml script to handle any various needs in the future, here's an explanation so you can expand:

        ^:= "from the beginning"
        .*:= "zero or more of anything
        \n:="newline char"
        $_:= "perl special variable that refers to the current line of a file being read"
        <>:="anonymous filehandle reading from stdin
        \t:="tab char"
        split(/pattern/, object or var):= "perl function that splits"
        $_=~s/\n//g; #this code statement says to substitute any newline characters within $_ (the current line) globally (g), and sub them for nothing (notice empty /\n// forward slashes).
        Last edited by JohnK; 05-02-2010, 10:18 AM.

        Comment


        • #5
          Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.

          Comment


          • #6
            Originally posted by Simon Anders View Post
            Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.
            Thanks John and Steven for your suggestions and explanations. Simon, I posted my query in a hurry. However, you are right I will need transcript lengths summed over exons. Thanks for sharing your script. I will post my feedback after running your script soon, I have updated HTSeq to the latest version. I will see if I can tweak John's command script. btw I have also just installed Bioperl 1.6.1 so I can use some its features too.

            more later
            many thanks
            Siva

            Comment


            • #7
              Hi Simon:
              I keep getting an unexpected keyword argument error while trying to compile your script. I have the latest HTSeq 0.4.3 installed. Here is the error:

              Code:
              "../simon_anders_gff.py", line 3, in <module>
                  gff_file = HTSeq.GFF_Reader("ZmB73_4a.53_WGS.gff",end_included=True)
              TypeError: __init__() got an unexpected keyword argument 'end_included'
              Siva

              Comment


              • #8
                Originally posted by Simon Anders View Post
                Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.
                Well, that's precisely what my one-liner (#2) is actually supposed to do

                awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file

                For each line with a third column "mRNA" and a 9th column starting with "ID", it extracts the ID (more precisely, what is between the "ID=" and the first "_T" in the column #9 -I assume this format is conserved throughout all the file). It adds to the total length of this ID (initially 0) the length of the corresponding exon. "L" stores the total length of the IDs. Each time an exon is found with a known ID, the total length is updated.
                Once the whole file has been read ("END" loop), all IDs are printed with their corresponding length. The output should be a text file with two columns: one with the mRNA IDs, the other one with their sizes.

                Please let me know if there is something wrong with this line and/or if something else than the mRNA lengths is required (in which case, a more elegant and clean package like HTSeq is probably the best option indeed).

                Comment


                • #9
                  Originally posted by steven View Post
                  Well, that's precisely what my one-liner (#2) is actually supposed to do

                  awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file
                  Steven your awk one liner works great. For example, I have 21 lines in the truncated GFF file depicting only a single gene (ID) with two transcripts on 'chr1'.

                  Code:
                   Input file: ZmB73_4a.53_WGS_chr1_21.gff
                  chr1    ensembl chromosome      1       300239041       .       .       .       ID=1;Name=chromosome:AGPv1:1:1:300239041:1
                  chr1    ensembl exon    3       104     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E07
                  chr1    ensembl gene    3       3807    .       +       .       ID=GRMZM2G060082;Name=GRMZM2G060082;biotype=protein_coding
                  chr1    ensembl mRNA    3       3807    .       +       .       ID=GRMZM2G060082_T01;Parent=GRMZM2G060082;Name=GRMZM2G060082_T01;biotype=protein_coding
                  chr1    ensembl intron  105     199     .       +       .       Parent=GRMZM2G060082_T01
                  chr1    ensembl exon    200     313     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E06
                  chr1    ensembl CDS     230     313     .       +       .       Parent=GRMZM2G060082_T01;Name=CDS.98360
                  chr1    ensembl intron  314     421     .       +       .       Parent=GRMZM2G060082_T01
                  chr1    ensembl CDS     422     604     .       +       0       Parent=GRMZM2G060082_T01;Name=CDS.98361
                  chr1    ensembl exon    422     604     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E05
                  chr1    ensembl intron  605     1015    .       +       .       Parent=GRMZM2G060082_T01
                  chr1    ensembl CDS     1016    1233    .       +       0       Parent=GRMZM2G060082_T01;Name=CDS.98362
                  chr1    ensembl exon    1016    1233    .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E04
                  chr1    ensembl intron  1234    1620    .       +       .       Parent=GRMZM2G060082_T01
                  chr1    ensembl CDS     1621    1966    .       +       2       Parent=GRMZM2G060082_T01;Name=CDS.98363
                  chr1    ensembl exon    1621    3807    .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E01
                  chr1    ensembl exon    2607    3218    .       +       .       Parent=GRMZM2G060082_T02;Name=GRMZM2G060082_E03
                  chr1    ensembl mRNA    2607    3754    .       +       .       ID=GRMZM2G060082_T02;Parent=GRMZM2G060082;Name=GRMZM2G060082_T02;biotype=protein_coding
                  chr1    ensembl CDS     2765    2947    .       +       .       Parent=GRMZM2G060082_T02;Name=CDS.98367
                  chr1    ensembl intron  3219    3300    .       +       .       Parent=GRMZM2G060082_T02
                  chr1    ensembl exon    3301    3754    .       +       .       Parent=GRMZM2G060082_T02;Name=GRMZM2G060082_E02
                  A friend and I made minor modifications to your one-liner to give transcript lengths summed over exons like this:

                  Code:
                  command:
                  awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);L[substr($9,8)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' ZmB73_4a.53_WGS_chr1_21.gff
                  This is the output:

                  Code:
                  GRMZM2G060082_T01 2804
                  GRMZM2G060082_T02 1066
                  Awk is truly powerful!! Can you modify it further to add a column that gives the number of exons per transcript?

                  thanks
                  Siva
                  Last edited by Siva; 05-05-2010, 09:34 AM.

                  Comment


                  • #10
                    Originally posted by Siva View Post
                    Hi Simon:
                    I keep getting an unexpected keyword argument error while trying to compile your script. I have the latest HTSeq 0.4.3 installed. Here is the error:

                    Code:
                    "../simon_anders_gff.py", line 3, in <module>
                        gff_file = HTSeq.GFF_Reader("ZmB73_4a.53_WGS.gff",end_included=True)
                    TypeError: __init__() got an unexpected keyword argument 'end_included'
                    Siva
                    Simon:
                    Do I need to install another package for your script to calculate length to work?
                    I built HTSeq from source and installed it like a binary as per your instructions. htseq-count works fine.

                    Siva

                    Comment


                    • #11
                      Originally posted by Siva View Post
                      Simon:
                      Do I need to install another package for your script to calculate length to work?
                      I built HTSeq from source and installed it like a binary as per your instructions. htseq-count works fine.
                      No, you need to upgrade to the newest version of HTSeq.

                      The reason is an annoying lack of clarity in the GFF specs, which I noticed when I wrote my post above. If, in a GFF file, a feature is said to range from, say, 1000 to 1015, is the base at position 1015 part of the feature? In other words: Is the length of the feature 1015-1000=15, or is it 1015-1000+1=16?

                      If you look at GTF files from Ensembl, you will notice that the end is not included. You can see that from looking at CDS entries, because their length has to be a mutiple of three. Your GFF file however, by the same criterion, includes the ende.

                      So, who is right? The original GFF specification is silent on the issue. Hence, I added a new option to HTSeq.GFF_Reader to allow you to specify which convention should be used.

                      Simon

                      Comment


                      • #12
                        simon:
                        I see!! Although I installed the latest version of HTSeq over the old version. Every time I type >>import HTSeq.....the old version gets imported. I will see if I can upgrade it.

                        Siva

                        Comment


                        • #13
                          deleted this message as I posted a new one...!!
                          Last edited by Siva; 05-05-2010, 08:52 PM.

                          Comment


                          • #14
                            Hi Simon
                            I am not sure what is happening. I built and installed your May 4 release HTSeq 0.4.3-p1, but still I get the same error "unexpected keyword end_included" while calling HTSeq.GFF_Reader. I thought that when installing a new version of a Python package the older version is automatically upgraded and the old dependencies etc., are erased. That does not seem to be the case here. Should I upgrade PyPI on the lines of CPAN::upgrade? Is there something else I am overlooking?

                            Siva
                            ps: this is the error message

                            Code:
                            >>> gff_file = HTSeq.GFF_Reader ("ZmB73_4a.53_WGS.gff.gz", end_included=True)
                            Traceback (most recent call last):
                              File "<stdin>", line 1, in <module>
                            TypeError: __init__() got an unexpected keyword argument 'end_included'

                            Comment


                            • #15
                              Originally posted by Siva View Post
                              I thought that when installing a new version of a Python package the older version is automatically upgraded and the old dependencies etc., are erased.
                              It seems that it's still there. Try the following. Start a Python session and type:

                              Code:
                              import HTSeq
                              print HTSeq.__file__
                              This will print out where Python found HTSeq and will indicate to you where the directory with the outdated version of HTSeq is hanging out on your hard disk. Just delete it and try again.

                              This here, by the way, gives you a list of all directories oin which Python searches (in the given order) for packages when doing an 'import':

                              Code:
                              import sys
                              print sys.path
                              Hope that helps
                              Simon

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X