Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by Siva View Post
    Steven your awk one liner works great.
    Thanks, glad i could help.

    Awk is truly powerful!!
    Yes, but not the best if you need more than a quick and dirty one liner. My ones sometimes get pages long and i do not recommend this. One typo in the name of a variable and you are done.. plus the memory management is not as good as in perl i think (i don't know about python).

    Can you modify it further to add a column that gives the number of exons per transcript?
    Sure. I would simply try

    Code:
    awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);ID=substr($9,8);L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' ZmB73_4a.53_WGS_chr1_21.gff
    cheers,
    s.

    Comment


    • #17
      [QUOTE=Simon Anders;18102]It seems that it's still there. Try the following. Start a Python session and type:

      Code:
      import HTSeq
      print HTSeq.__file__
      This will print out where Python found HTSeq and will indicate to you where the directory with the outdated version of HTSeq is hanging out on your hard disk. Just delete it and try again.

      Simon
      Thanks, it now works. Will post feedback after running it on the full GFF.

      Siva

      Comment


      • #18
        Originally posted by steven View Post
        Sure. I would simply try

        Code:
        awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);ID=substr($9,8);L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' ZmB73_4a.53_WGS_chr1_21.gff
        cheers,
        s.
        Steven
        This works fine too. Thanks!! Will post feedback soon.

        Siva

        Comment


        • #19
          when running py script on gff file of zea mays v3.21 from ensembl, it popped up this error:
          Traceback (most recent call last):
          File "lenTransGff.py", line 14, in <module>
          for feature in gff_file:
          File "/usr/lib64/python2.6/site-packages/HTSeq-0.5.4p5-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 223, in __iter__
          iv = GenomicInterval( seqname, int(start)-1, int(end), strand )
          File "_HTSeq.pyx", line 62, in HTSeq._HTSeq.GenomicInterval.__init__ (src/_HTSeq.c:2789)
          File "_HTSeq.pyx", line 71, in HTSeq._HTSeq.GenomicInterval.strand.__set__ (src/_HTSeq.c:2910)
          ValueError: Strand must be'+', '-', or '.'.
          any idea?

          for latest ensembl gff file of zea mays AGPv3.21, the awk script should also need some changes for matching in column 9. here is the one I modified:
          zcat Zea_mays.AGPv3.21.gff3.gz | awk '$3=="exon" && $9~/Parent/{sub(/.*Parent=/,"",$9);sub(/;.*/,"",$9);ID=$9;L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' > len_trans_Enum.txt
          hope some one need this.

          Comment


          • #20
            Need to find intronic regions from a gtf gile

            Hello Simon,

            I found the HTSeq amazing to work with, I am trying to find intronic regions from a gtf file for horse. Here in this thread, you have clearly explained to get transcripts and their respective lengths. But what if a single gene have multiple transcripts, i wanted to choose the transcripts with maximum length. I wanted a output which gives gene_id, transcript_id, intron_start, intron_end, transcript_length.

            Could you please give me a idea how to do this using HTseq.

            Thanks in advance.

            Comment


            • #21
              Originally posted by Simon Anders View Post
              Code:
                 for exon in transcripts[ transcript_id ]:
                    # Each GenomicFeature object has a slot called 'iv', which is a
                    # GenomicInterval object, which in turn contains slots such as
                    # 'chrom', 'start', 'end', 'strand', 'length', etc., to describe
                    # where the feature sits in the genome. We use 'length' here.
                    transcript_length += exon.iv.length + 1
                    # We add the 1 because this GFF file does no
                 # Now, 'transcript_length' is the sum of the lengths of all the
                 # transcripts exon.
                 # Print this result, and also add the number of exons as (which is just the 
                 # length of the list of exons for the transcript):
                 print transcript_id, transcript_length, len( transcripts[ transcript_id ] )
              Simon
              So the iv object already accounted for the 1-based counting and end inclusive (information from the GFF_reader documentation of HTSeq) when calculating the length? So why do you need a +1 again? The documentation seems to want to explain it but somehow truncated.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-27-2024, 06:37 PM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-27-2024, 06:07 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              69 views
              0 likes
              Last Post seqadmin  
              Working...
              X