Seqanswers Leaderboard Ad

**Simon Anders** · 03-06-2013, 12:17 PM

To start with the second one: In an Ensembl GTF file, each exon appears twice, once with type "exon", once with "CDS". The start of exon 1 is a transcription start site, the start of CDS 1 is a translation start site. (For exons without untranslated regions, the coordinates in the exon line and the CDS line are the same.)

I have never checked whether the GTF files from UCSC follow this convention, too, but given the general mess that the GFF/GTF format is, it would be no surprise if it were different.

Now, to your actual Python question. The easiest might be to first read the expression-stratified gene lists into a set:

Code:

highexpr = set()
for line in open(highexpression):
   highexpr.add( line.split()[0] )

Now, you have all the highly expressed genes in memory. Hence, when you loop through the GFF file, you do:

Code:

for feature in gtf_file:
   if feature.name in highexpr:
      do_something()

As you can see, using a container to store the information from one file in memory allows you to process your two files separately and so avoid the nested loop. This, of course is a very general and basic design pattern that you will encounter very often. (The different kinds of data containers that programming languages offer and how and when to use them is probably the next thing you need to study to further your programming skills.)

**crazyhottommy** · 03-06-2013, 03:42 PM

Originally posted by Simon Anders View Post

To start with the second one: In an Ensembl GTF file, each exon appears twice, once with type "exon", once with "CDS". The start of exon 1 is a transcription start site, the start of CDS 1 is a translation start site. (For exons without untranslated regions, the coordinates in the exon line and the CDS line are the same.)

I have never checked whether the GTF files from UCSC follow this convention, too, but given the general mess that the GFF/GTF format is, it would be no surprise if it were different.

Now, to your actual Python question. The easiest might be to first read the expression-stratified gene lists into a set:

Code:

highexpr = set()
for line in open(highexpression):
   highexpr.add( line.split()[0] )

Now, you have all the highly expressed genes in memory. Hence, when you loop through the GFF file, you do:

Code:

for feature in gtf_file:
   if feature.name in highexpr:
      do_something()

As you can see, using a container to store the information from one file in memory allows you to process your two files separately and so avoid the nested loop. This, of course is a very general and basic design pattern that you will encounter very often. (The different kinds of data containers that programming languages offer and how and when to use them is probably the next thing you need to study to further your programming skills.)

Thanks so much for your quick reply!
I am glad that your clarification about the gtf file. I usually use gtf files from UCSC.

for the python question, I actually figure it out in another way after spending some time searching online, but the basic idea is the same as yours. wow, google is the best teacher

gtf_dict={ } #make a dictionary. key is the gene_name, value is the TSS infomation
for feature in gtf_file:
gtf_dict[feature.name]= feature.iv.start_d_as_pos

tsspos=set()

for line in highexpr:
linelist=line.split()
try:
tsspos.add(gtf_dict[linelist[0]])
except:
continue # in case the key is not present

And it run much faster!

Thanks again for your nice reply.

**crazyhottommy** · 04-03-2013, 11:51 AM

Hi Simon,

I just do not want to open another thread. I want to do some Interval overlapping by HTSeq.
Let's say I have two ChIP-seq data sets, they are bed file with four columns( chr start end some_value)
and I want to get the overlapping intervals. I do not want to use nested loops...

I read http://psaffrey.wordpress.com/2011/04/

How can I do it efficiently by using the GenomicInterval class with the GenomicsInterval.overlap() method?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

HTSeq extract part of the GTF information

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News