Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq extract part of the GTF information

    Hi all,

    I am a beginner of python and just installed the HTSeq package.

    my question is:

    after read in the GTF file with the HTSeq.GFF_reader function, is there an efficient way
    to exact certain lines from the GTF file?

    now, I have three gene lists: high expression, medium expression and low expression,
    with first column the gene name, and the second column counts of RNA-seq (generated by htseq.count)

    one way I can do is:
    gtf_file=HTseq.GFF_reader(ifile)

    genelist= open(highexpression)

    for line in genelist:
    line_list=line.split()
    for feature in gtf_file:
    if feature.name==linelist[0]: # basically check the gene_name.
    do something

    but it is really slow because of the nested loop.

    I want to plot a TSS plot as shown here http://www-huber.embl.de/users/ander...c/tss.html#tss

    but I want to group Transcription start sites according to their expression levels(high, medium and low), and then plot three curves together.
    -------------------------------------------------------------
    one comment on the TSS plot is that when Dr.Simon Anders (the package creator, thanks!) drew
    the TSS plots, he used the start point of exon1 as TSS, it is not correct.
    TSS is the transcription start site,
    the start point of the exon1 is the translation start site.

    to plot a TSS plot, we need the GTF file containing the 5UTR information
    which can be obtained here http://genomewiki.ucsc.edu/index.php..._or_gff_format

  • #2
    To start with the second one: In an Ensembl GTF file, each exon appears twice, once with type "exon", once with "CDS". The start of exon 1 is a transcription start site, the start of CDS 1 is a translation start site. (For exons without untranslated regions, the coordinates in the exon line and the CDS line are the same.)

    I have never checked whether the GTF files from UCSC follow this convention, too, but given the general mess that the GFF/GTF format is, it would be no surprise if it were different.

    Now, to your actual Python question. The easiest might be to first read the expression-stratified gene lists into a set:

    Code:
    highexpr = set()
    for line in open(highexpression):
       highexpr.add( line.split()[0] )
    Now, you have all the highly expressed genes in memory. Hence, when you loop through the GFF file, you do:
    Code:
    for feature in gtf_file:
       if feature.name in highexpr:
          do_something()
    As you can see, using a container to store the information from one file in memory allows you to process your two files separately and so avoid the nested loop. This, of course is a very general and basic design pattern that you will encounter very often. (The different kinds of data containers that programming languages offer and how and when to use them is probably the next thing you need to study to further your programming skills.)
    Last edited by Simon Anders; 03-06-2013, 12:19 PM.

    Comment


    • #3
      Originally posted by Simon Anders View Post
      To start with the second one: In an Ensembl GTF file, each exon appears twice, once with type "exon", once with "CDS". The start of exon 1 is a transcription start site, the start of CDS 1 is a translation start site. (For exons without untranslated regions, the coordinates in the exon line and the CDS line are the same.)

      I have never checked whether the GTF files from UCSC follow this convention, too, but given the general mess that the GFF/GTF format is, it would be no surprise if it were different.

      Now, to your actual Python question. The easiest might be to first read the expression-stratified gene lists into a set:

      Code:
      highexpr = set()
      for line in open(highexpression):
         highexpr.add( line.split()[0] )
      Now, you have all the highly expressed genes in memory. Hence, when you loop through the GFF file, you do:
      Code:
      for feature in gtf_file:
         if feature.name in highexpr:
            do_something()
      As you can see, using a container to store the information from one file in memory allows you to process your two files separately and so avoid the nested loop. This, of course is a very general and basic design pattern that you will encounter very often. (The different kinds of data containers that programming languages offer and how and when to use them is probably the next thing you need to study to further your programming skills.)

      Thanks so much for your quick reply!
      I am glad that your clarification about the gtf file. I usually use gtf files from UCSC.

      for the python question, I actually figure it out in another way after spending some time searching online, but the basic idea is the same as yours. wow, google is the best teacher



      gtf_dict={ } #make a dictionary. key is the gene_name, value is the TSS infomation
      for feature in gtf_file:
      gtf_dict[feature.name]= feature.iv.start_d_as_pos

      tsspos=set()

      for line in highexpr:
      linelist=line.split()
      try:
      tsspos.add(gtf_dict[linelist[0]])
      except:
      continue # in case the key is not present

      And it run much faster!

      Thanks again for your nice reply.

      Comment


      • #4
        Hi Simon,

        I just do not want to open another thread. I want to do some Interval overlapping by HTSeq.
        Let's say I have two ChIP-seq data sets, they are bed file with four columns( chr start end some_value)
        and I want to get the overlapping intervals. I do not want to use nested loops...

        I read http://psaffrey.wordpress.com/2011/04/

        How can I do it efficiently by using the GenomicInterval class with the GenomicsInterval.overlap() method?

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X