SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Updated How to convert .txt file to .bed .GFF or .BAR file format, forevermark4 Bioinformatics 2 06-30-2014 05:02 AM
GFF Annotation file AAWT Bioinformatics 3 12-11-2012 11:58 PM
Mixed read lengths in TopHat input file shurjo Bioinformatics 13 11-01-2012 04:46 AM
GFF file formatting naluru Bioinformatics 5 03-29-2011 11:21 AM
Convert segemehl *.map file into gff file. satheshsiva Bioinformatics 0 07-16-2010 04:40 AM

Reply
 
Thread Tools
Old 04-30-2010, 09:47 AM   #1
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default Getting transcript lengths from GFF file

Hello
I have the maize reference annotation downloaded from http://ftp.maizesequence.org/current/working-set/

the first 10 lines are as under:
Code:
UNKNOWN	ensembl	chromosome	1	14680007	.	.	.	ID=UNKNOWN;Name=chromosome:AGPv1:UNKNOWN:1:14680007:1
UNKNOWN	ensembl	CDS	17354	17446	.	+	0	Parent=GRMZM2G524142_T01;Name=CDS.1254885
UNKNOWN	ensembl	exon	17354	17460	.	+	.	Parent=GRMZM2G524142_T01;Name=GRMZM2G524142_E01
UNKNOWN	ensembl	gene	17354	17460	.	+	.	ID=GRMZM2G524142;Name=GRMZM2G524142;biotype=protein_coding
UNKNOWN	ensembl	mRNA	17354	17460	.	+	.	ID=GRMZM2G524142_T01;Parent=GRMZM2G524142;Name=GRMZM2G524142_T01;biotype=protein_coding
UNKNOWN	ensembl	exon	17537	17976	.	-	.	Parent=GRMZM2G369735_T01;Name=GRMZM2G369735_E03
UNKNOWN	ensembl	gene	17537	18259	.	-	.	ID=GRMZM2G369735;Name=GRMZM2G369735;biotype=protein_coding
UNKNOWN	ensembl	mRNA	17537	18259	.	-	.	ID=GRMZM2G369735_T01;Parent=GRMZM2G369735;Name=GRMZM2G369735_T01;biotype=protein_coding
UNKNOWN	ensembl	CDS	17716	17952	.	-	.	Parent=GRMZM2G369735_T01;Name=CDS.1254890
UNKNOWN	ensembl	exon	17972	18045	.	-	.	Parent=GRMZM2G369735_T02;Name=GRMZM2G369735_E02
How can I output transcript/feature length from the given genome coordinates? Is there a simple perl script out there? Or is there some script available in ENSembL API? Any help or suggestions appreciated?

thanks
Siva
Siva is offline   Reply With Quote
Old 04-30-2010, 11:22 AM   #2
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

First quick and dirty try:
awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file
I guess a much better solution will come soon..
steven is offline   Reply With Quote
Old 05-01-2010, 02:49 AM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Siva

Quote:
Originally Posted by Siva View Post
How can I output transcript/feature length from the given genome coordinates? Is there a simple perl script out there? Or is there some script available in ENSembL API? Any help or suggestions appreciated?
This is a typical use case for my HTSeq Python framework.

The following script does the job. Save it and run it with Python (after having installed the newest version, 0.4.3, of HTSeq):

Code:
import HTSeq

gff_file = HTSeq.GFF_Reader( "ZmB73_4a.53_WGS.gff.gz", end_included=True )

transcripts= {}

for feature in gff_file:
   if feature.type == "exon":
      transcript_id = feature.attr['Parent']
      if transcript_id not in transcripts:
         transcripts[ transcript_id ] = list()
      transcripts[ transcript_id ].append( feature )
      
for transcript_id in sorted( transcripts ):      
   transcript_length = 0
   for exon in transcripts[ transcript_id ]:
      transcript_length += exon.iv.length + 1
   print transcript_id, transcript_length, len( transcripts[ transcript_id ] )
After a minute or two, it outputs a list of transcripts, each line giving the transcript ID, the length (summed over all its exons), and the number of exons.

In order to advertise HTSeq a bit and demonstrate that it is easy to use I add here a commented version of the same script. Even if you've obtained only a minimal knowledge of Python (e.g., by reading only the first few chapters of Mark Lutz's very good book "Learning Python"), you will find that it iss easy to write such stuff yourself.

Code:
import HTSeq

# A GFF_Reader parses a GFF file and presents its content as
# a sequence of 'GenomicFeature' objects.
gff_file = HTSeq.GFF_Reader( "ZmB73_4a.53_WGS.gff.gz", end_included=True )
# In this GFF file, the 'end' coordinate is part of the feature,
# hence 'end_included=True'.

# The following dict will store, for each transcript, a list
# of its exons
transcripts= {}

# Go through the GFF file and look for all exons:
for feature in gff_file:
   # 'feature' is an HTSeq.GenomicFeature object, which describes a
   # line in the GFF file. It has several slots, for example, 'iv'
   # for the interval (where it is on the genome), 'type' for its
   # type (3rd column in the GFF file) and 'attr' for the
   # attributes (last column)
   # We are only interested in lines describing exons:
   if feature.type == "exon":
      # In this GFF file, the ID of the transcript to which the exons
      # belongs is stored in an attribute (i.e., something in the 9th
      # column) called "Parent"
      transcript_id = feature.attr['Parent']
      # Is this bthe first time we see this transcript?
      if transcript_id not in transcripts:
         # If so, add to the 'transcripts' dict an empty list 
         # store store the exons in
         transcripts[ transcript_id ] = list()
      # Add the exon to the list
      transcripts[ transcript_id ].append( feature )
      
# Now, go through the transcripts and calculate the length of each
# by adding up the lengths of its exons
for transcript_id in sorted( transcripts ):      
   transcript_length = 0
   # Go through the GenomicFeature
   for exon in transcripts[ transcript_id ]:
      # Each GenomicFeature object has a slot called 'iv', which is a
      # GenomicInterval object, which in turn contains slots such as
      # 'chrom', 'start', 'end', 'strand', 'length', etc., to describe
      # where the feature sits in the genome. We use 'length' here.
      transcript_length += exon.iv.length + 1
      # We add the 1 because this GFF file does no
   # Now, 'transcript_length' is the sum of the lengths of all the
   # transcripts exon.
   # Print this result, and also add the number of exons as (which is just the 
   # length of the list of exons for the transcript):
   print transcript_id, transcript_length, len( transcripts[ transcript_id ] )
All the classes mentioned here are, of course, explained in detail in HTSeq's reference documentation on the web site.

Simon
Simon Anders is offline   Reply With Quote
Old 05-02-2010, 10:06 AM   #4
JohnK
Senior Member
 
Location: Los Angeles, China.

Join Date: Feb 2010
Posts: 106
Default

Quote:
Originally Posted by Siva View Post
Hello
I have the maize reference annotation downloaded from http://ftp.maizesequence.org/current/working-set/

the first 10 lines are as under:
Code:
UNKNOWN	ensembl	chromosome	1	14680007	.	.	.	ID=UNKNOWN;Name=chromosome:AGPv1:UNKNOWN:1:14680007:1
UNKNOWN	ensembl	CDS	17354	17446	.	+	0	Parent=GRMZM2G524142_T01;Name=CDS.1254885
UNKNOWN	ensembl	exon	17354	17460	.	+	.	Parent=GRMZM2G524142_T01;Name=GRMZM2G524142_E01
UNKNOWN	ensembl	gene	17354	17460	.	+	.	ID=GRMZM2G524142;Name=GRMZM2G524142;biotype=protein_coding
UNKNOWN	ensembl	mRNA	17354	17460	.	+	.	ID=GRMZM2G524142_T01;Parent=GRMZM2G524142;Name=GRMZM2G524142_T01;biotype=protein_coding
UNKNOWN	ensembl	exon	17537	17976	.	-	.	Parent=GRMZM2G369735_T01;Name=GRMZM2G369735_E03
UNKNOWN	ensembl	gene	17537	18259	.	-	.	ID=GRMZM2G369735;Name=GRMZM2G369735;biotype=protein_coding
UNKNOWN	ensembl	mRNA	17537	18259	.	-	.	ID=GRMZM2G369735_T01;Parent=GRMZM2G369735;Name=GRMZM2G369735_T01;biotype=protein_coding
UNKNOWN	ensembl	CDS	17716	17952	.	-	.	Parent=GRMZM2G369735_T01;Name=CDS.1254890
UNKNOWN	ensembl	exon	17972	18045	.	-	.	Parent=GRMZM2G369735_T02;Name=GRMZM2G369735_E02
How can I output transcript/feature length from the given genome coordinates? Is there a simple perl script out there? Or is there some script available in ENSembL API? Any help or suggestions appreciated?

thanks
Siva
If I misunderstood your question, I apologize. Here's a perl-way:

< <input file> perl -e 'while(<>){ @splitted=split('\t', $_); $length= $splitted[4]-$splitted[3] + 1; $_=~s/\n//g; print $_, "\t", $length, "\n"; }' > <out file name> &

This will add the length on the end of each line and print to <out file name>. Use & to run in the background so you can carry on with life.

, And if you just want 'mRNA' "transcripts":

< <input file> perl -e 'while(<>){ @splitted=split('\t', $_); $length= $splitted[4]-$splitted[3] + 1; $_=~s/\n//g; if($_=~/^.*mRNA.*\n/){print $_, "\t", $length, "\n";} }' > <out file name> &

Only prints on lines with mRNA feature name included. Oh, and just so you can manipulate the above cml script to handle any various needs in the future, here's an explanation so you can expand:

^:= "from the beginning"
.*:= "zero or more of anything
\n:="newline char"
$_:= "perl special variable that refers to the current line of a file being read"
<>:="anonymous filehandle reading from stdin
\t:="tab char"
split(/pattern/, object or var):= "perl function that splits"
$_=~s/\n//g; #this code statement says to substitute any newline characters within $_ (the current line) globally (g), and sub them for nothing (notice empty /\n// forward slashes).

Last edited by JohnK; 05-02-2010 at 10:18 AM.
JohnK is offline   Reply With Quote
Old 05-02-2010, 10:12 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.
Simon Anders is offline   Reply With Quote
Old 05-02-2010, 02:47 PM   #6
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Quote:
Originally Posted by Simon Anders View Post
Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.
Thanks John and Steven for your suggestions and explanations. Simon, I posted my query in a hurry. However, you are right I will need transcript lengths summed over exons. Thanks for sharing your script. I will post my feedback after running your script soon, I have updated HTSeq to the latest version. I will see if I can tweak John's command script. btw I have also just installed Bioperl 1.6.1 so I can use some its features too.

more later
many thanks
Siva
Siva is offline   Reply With Quote
Old 05-02-2010, 10:25 PM   #7
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Hi Simon:
I keep getting an unexpected keyword argument error while trying to compile your script. I have the latest HTSeq 0.4.3 installed. Here is the error:

Code:
"../simon_anders_gff.py", line 3, in <module>
    gff_file = HTSeq.GFF_Reader("ZmB73_4a.53_WGS.gff",end_included=True)
TypeError: __init__() got an unexpected keyword argument 'end_included'
Siva
Siva is offline   Reply With Quote
Old 05-03-2010, 01:21 AM   #8
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Quote:
Originally Posted by Simon Anders View Post
Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.
Well, that's precisely what my one-liner (#2) is actually supposed to do

awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file

For each line with a third column "mRNA" and a 9th column starting with "ID", it extracts the ID (more precisely, what is between the "ID=" and the first "_T" in the column #9 -I assume this format is conserved throughout all the file). It adds to the total length of this ID (initially 0) the length of the corresponding exon. "L" stores the total length of the IDs. Each time an exon is found with a known ID, the total length is updated.
Once the whole file has been read ("END" loop), all IDs are printed with their corresponding length. The output should be a text file with two columns: one with the mRNA IDs, the other one with their sizes.

Please let me know if there is something wrong with this line and/or if something else than the mRNA lengths is required (in which case, a more elegant and clean package like HTSeq is probably the best option indeed).
steven is offline   Reply With Quote
Old 05-05-2010, 08:57 AM   #9
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Quote:
Originally Posted by steven View Post
Well, that's precisely what my one-liner (#2) is actually supposed to do

awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file
Steven your awk one liner works great. For example, I have 21 lines in the truncated GFF file depicting only a single gene (ID) with two transcripts on 'chr1'.

Code:
 Input file: ZmB73_4a.53_WGS_chr1_21.gff
chr1    ensembl chromosome      1       300239041       .       .       .       ID=1;Name=chromosome:AGPv1:1:1:300239041:1
chr1    ensembl exon    3       104     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E07
chr1    ensembl gene    3       3807    .       +       .       ID=GRMZM2G060082;Name=GRMZM2G060082;biotype=protein_coding
chr1    ensembl mRNA    3       3807    .       +       .       ID=GRMZM2G060082_T01;Parent=GRMZM2G060082;Name=GRMZM2G060082_T01;biotype=protein_coding
chr1    ensembl intron  105     199     .       +       .       Parent=GRMZM2G060082_T01
chr1    ensembl exon    200     313     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E06
chr1    ensembl CDS     230     313     .       +       .       Parent=GRMZM2G060082_T01;Name=CDS.98360
chr1    ensembl intron  314     421     .       +       .       Parent=GRMZM2G060082_T01
chr1    ensembl CDS     422     604     .       +       0       Parent=GRMZM2G060082_T01;Name=CDS.98361
chr1    ensembl exon    422     604     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E05
chr1    ensembl intron  605     1015    .       +       .       Parent=GRMZM2G060082_T01
chr1    ensembl CDS     1016    1233    .       +       0       Parent=GRMZM2G060082_T01;Name=CDS.98362
chr1    ensembl exon    1016    1233    .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E04
chr1    ensembl intron  1234    1620    .       +       .       Parent=GRMZM2G060082_T01
chr1    ensembl CDS     1621    1966    .       +       2       Parent=GRMZM2G060082_T01;Name=CDS.98363
chr1    ensembl exon    1621    3807    .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E01
chr1    ensembl exon    2607    3218    .       +       .       Parent=GRMZM2G060082_T02;Name=GRMZM2G060082_E03
chr1    ensembl mRNA    2607    3754    .       +       .       ID=GRMZM2G060082_T02;Parent=GRMZM2G060082;Name=GRMZM2G060082_T02;biotype=protein_coding
chr1    ensembl CDS     2765    2947    .       +       .       Parent=GRMZM2G060082_T02;Name=CDS.98367
chr1    ensembl intron  3219    3300    .       +       .       Parent=GRMZM2G060082_T02
chr1    ensembl exon    3301    3754    .       +       .       Parent=GRMZM2G060082_T02;Name=GRMZM2G060082_E02
A friend and I made minor modifications to your one-liner to give transcript lengths summed over exons like this:

Code:
command:
awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);L[substr($9,8)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' ZmB73_4a.53_WGS_chr1_21.gff
This is the output:

Code:
GRMZM2G060082_T01 2804
GRMZM2G060082_T02 1066
Awk is truly powerful!! Can you modify it further to add a column that gives the number of exons per transcript?

thanks
Siva

Last edited by Siva; 05-05-2010 at 09:34 AM.
Siva is offline   Reply With Quote
Old 05-05-2010, 09:13 AM   #10
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Quote:
Originally Posted by Siva View Post
Hi Simon:
I keep getting an unexpected keyword argument error while trying to compile your script. I have the latest HTSeq 0.4.3 installed. Here is the error:

Code:
"../simon_anders_gff.py", line 3, in <module>
    gff_file = HTSeq.GFF_Reader("ZmB73_4a.53_WGS.gff",end_included=True)
TypeError: __init__() got an unexpected keyword argument 'end_included'
Siva
Simon:
Do I need to install another package for your script to calculate length to work?
I built HTSeq from source and installed it like a binary as per your instructions. htseq-count works fine.

Siva
Siva is offline   Reply With Quote
Old 05-05-2010, 12:04 PM   #11
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by Siva View Post
Simon:
Do I need to install another package for your script to calculate length to work?
I built HTSeq from source and installed it like a binary as per your instructions. htseq-count works fine.
No, you need to upgrade to the newest version of HTSeq.

The reason is an annoying lack of clarity in the GFF specs, which I noticed when I wrote my post above. If, in a GFF file, a feature is said to range from, say, 1000 to 1015, is the base at position 1015 part of the feature? In other words: Is the length of the feature 1015-1000=15, or is it 1015-1000+1=16?

If you look at GTF files from Ensembl, you will notice that the end is not included. You can see that from looking at CDS entries, because their length has to be a mutiple of three. Your GFF file however, by the same criterion, includes the ende.

So, who is right? The original GFF specification is silent on the issue. Hence, I added a new option to HTSeq.GFF_Reader to allow you to specify which convention should be used.

Simon
Simon Anders is offline   Reply With Quote
Old 05-05-2010, 02:54 PM   #12
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

simon:
I see!! Although I installed the latest version of HTSeq over the old version. Every time I type >>import HTSeq.....the old version gets imported. I will see if I can upgrade it.

Siva
Siva is offline   Reply With Quote
Old 05-05-2010, 08:27 PM   #13
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

deleted this message as I posted a new one...!!

Last edited by Siva; 05-05-2010 at 08:52 PM.
Siva is offline   Reply With Quote
Old 05-05-2010, 08:50 PM   #14
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Hi Simon
I am not sure what is happening. I built and installed your May 4 release HTSeq 0.4.3-p1, but still I get the same error "unexpected keyword end_included" while calling HTSeq.GFF_Reader. I thought that when installing a new version of a Python package the older version is automatically upgraded and the old dependencies etc., are erased. That does not seem to be the case here. Should I upgrade PyPI on the lines of CPAN::upgrade? Is there something else I am overlooking?

Siva
ps: this is the error message

Code:
>>> gff_file = HTSeq.GFF_Reader ("ZmB73_4a.53_WGS.gff.gz", end_included=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument 'end_included'
Siva is offline   Reply With Quote
Old 05-05-2010, 11:41 PM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by Siva View Post
I thought that when installing a new version of a Python package the older version is automatically upgraded and the old dependencies etc., are erased.
It seems that it's still there. Try the following. Start a Python session and type:

Code:
import HTSeq
print HTSeq.__file__
This will print out where Python found HTSeq and will indicate to you where the directory with the outdated version of HTSeq is hanging out on your hard disk. Just delete it and try again.

This here, by the way, gives you a list of all directories oin which Python searches (in the given order) for packages when doing an 'import':

Code:
import sys
print sys.path
Hope that helps
Simon
Simon Anders is offline   Reply With Quote
Old 05-06-2010, 07:52 AM   #16
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Quote:
Originally Posted by Siva View Post
Steven your awk one liner works great.
Thanks, glad i could help.

Quote:
Awk is truly powerful!!
Yes, but not the best if you need more than a quick and dirty one liner. My ones sometimes get pages long and i do not recommend this. One typo in the name of a variable and you are done.. plus the memory management is not as good as in perl i think (i don't know about python).

Quote:
Can you modify it further to add a column that gives the number of exons per transcript?
Sure. I would simply try

Code:
awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);ID=substr($9,8);L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' ZmB73_4a.53_WGS_chr1_21.gff
cheers,
s.
steven is offline   Reply With Quote
Old 05-06-2010, 01:31 PM   #17
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

[QUOTE=Simon Anders;18102]It seems that it's still there. Try the following. Start a Python session and type:

Code:
import HTSeq
print HTSeq.__file__
This will print out where Python found HTSeq and will indicate to you where the directory with the outdated version of HTSeq is hanging out on your hard disk. Just delete it and try again.

Simon
Thanks, it now works. Will post feedback after running it on the full GFF.

Siva
Siva is offline   Reply With Quote
Old 05-06-2010, 01:33 PM   #18
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Quote:
Originally Posted by steven View Post
Sure. I would simply try

Code:
awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);ID=substr($9,8);L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' ZmB73_4a.53_WGS_chr1_21.gff
cheers,
s.
Steven
This works fine too. Thanks!! Will post feedback soon.

Siva
Siva is offline   Reply With Quote
Old 01-22-2014, 05:28 AM   #19
eastasiasnow
Junior Member
 
Location: China

Join Date: Jan 2014
Posts: 8
Default

when running py script on gff file of zea mays v3.21 from ensembl, it popped up this error:
Quote:
Traceback (most recent call last):
File "lenTransGff.py", line 14, in <module>
for feature in gff_file:
File "/usr/lib64/python2.6/site-packages/HTSeq-0.5.4p5-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 223, in __iter__
iv = GenomicInterval( seqname, int(start)-1, int(end), strand )
File "_HTSeq.pyx", line 62, in HTSeq._HTSeq.GenomicInterval.__init__ (src/_HTSeq.c:2789)
File "_HTSeq.pyx", line 71, in HTSeq._HTSeq.GenomicInterval.strand.__set__ (src/_HTSeq.c:2910)
ValueError: Strand must be'+', '-', or '.'.
any idea?

for latest ensembl gff file of zea mays AGPv3.21, the awk script should also need some changes for matching in column 9. here is the one I modified:
Quote:
zcat Zea_mays.AGPv3.21.gff3.gz | awk '$3=="exon" && $9~/Parent/{sub(/.*Parent=/,"",$9);sub(/;.*/,"",$9);ID=$9;L[ID]+=$5-$4+1;N[ID]++}END{for(i in L){print i,L[i],N[i]}}' > len_trans_Enum.txt
hope some one need this.
eastasiasnow is offline   Reply With Quote
Old 08-05-2016, 06:39 AM   #20
swan_r
Junior Member
 
Location: North Carolina

Join Date: Apr 2012
Posts: 5
Default Need to find intronic regions from a gtf gile

Hello Simon,

I found the HTSeq amazing to work with, I am trying to find intronic regions from a gtf file for horse. Here in this thread, you have clearly explained to get transcripts and their respective lengths. But what if a single gene have multiple transcripts, i wanted to choose the transcripts with maximum length. I wanted a output which gives gene_id, transcript_id, intron_start, intron_end, transcript_length.

Could you please give me a idea how to do this using HTseq.

Thanks in advance.
swan_r is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO