Seqanswers Leaderboard Ad

**Jeremy** · 01-23-2013, 05:23 PM

Might help if you list the commands you used for HT-Seq.

**chadn737** · 01-23-2013, 05:38 PM

I would use the gff files from TAIR as those will be the most up to date and original files.

TAIR10 can be found here, it is the most recent genome version: ftp://ftp.arabidopsis.org/../../Gene...e/TAIR10_gff3/

TAIR9 is still often used, it is here:ftp://ftp.arabidopsis.org/../../Gene...se/TAIR9_gff3/

The TAIR gff files have some slight differences requiring you to give HTSeq some additional details, but they work great.

**Jeremy** · 01-23-2013, 05:58 PM

I mean what commands

Code:

htseq-count [options] <sam_file> <gff_file>

for example

Code:

htseq-count -t gene -i gene_id <sam_file> <gff_file>

**Artur Jaroszewicz** · 01-23-2013, 05:59 PM

Originally posted by chadn737 View Post

I would use the gff files from TAIR as those will be the most up to date and original files.

TAIR10 can be found here, it is the most recent genome version: ftp://ftp.arabidopsis.org/../../Gene...e/TAIR10_gff3/

TAIR9 is still often used, it is here:ftp://ftp.arabidopsis.org/../../Gene...se/TAIR9_gff3/

The TAIR gff files have some slight differences requiring you to give HTSeq some additional details, but they work great.

Thanks for the input, I'm trying it now. To the previous response, I pretty much followed Anders' example on this:

#!/usr/bin/python2.7 -tt

import HTSeq
import itertools
import sys

def main():

sys.stderr.write("Initializing program.\n")

#human
if sys.argv[1] == 'hg19':
gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/hg19.ensembl.gtf")
#mouse
elif sys.argv[1] == 'mm9':
gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/mm9.ensembl.gtf")
elif sys.argv[1] == 'TAIR10':
gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/TAIR10_GFF3_genes.gff")
else:
gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/" + sys.argv[1] + ".ensembl.gtf")
####
'''
path = sys.argv[2]
if not path[-1] == '/':
path += '/'
'''
filelist = []
for file in sys.argv[2:]:
name_split = file.split('/')
filelist.append(name_split[-1])
if len(name_split) > 1:
path = '/'.join(name_split[:-1]) + '/'
if not path[0] == '/':
path = '/' + path
else:
path = './'
samplelist = []
for entry in filelist:
samplelist.append(entry[:-18])
####

exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True)
for feature in gtf:
if feature.type == "exon":
exons[feature.iv] += feature.name

sys.stderr.write("Created exon array.\n")

counts = {}
for feature in gtf:
if feature.type == "exon":
counts[feature.name] = [0] * len(filelist)

sys.stderr.write("Initialized count dict.\n")

for filenum in range(len(filelist)):
sam_file = HTSeq.SAM_Reader( path + filelist[filenum] )
for alnmt in sam_file:
if alnmt.aligned:
intersection_set = None
for iv2, step_set in exons[alnmt.iv].steps():
if intersection_set is None:
intersection_set = step_set.copy()
else:
intersection_set.intersection_update(step_set)
if len(intersection_set) == 1:
counts[list(intersection_set)[0]][filenum] += 1
sys.stderr.write("Counted hits per gene for " + samplelist[filenum] + ".\n")

sys.stderr.write("Printing output.\n")

sys.stdout.write('Gene\t' + '\t'.join(samplelist) + '\n')
for gene in sorted(counts.keys()):
sys.stdout.write(gene)
for filenum in range(len(filelist)):
sys.stdout.write('\t' + str(counts[gene][filenum]))
sys.stdout.write('\n')

sys.stderr.write("Done.\n")

if __name__ == "__main__":
main()

Artur

**Artur Jaroszewicz** · 01-23-2013, 06:12 PM

I seem to have found my problem -- one of the lines in the GTF file has a semicolon in the gene_name: gene_name "PIP1;3"

I hope this resolves my problem!

**Simon Anders** · 01-24-2013, 01:20 AM

Ah, that might explain it. On the other hand, maybe not: a semicolon is not a quote. However, I have a dim recollection that I have once seen a prime (a.k.a. single quote: ' ) in an Arabidopsis gene name. So, maybe check for this, too.

(Who puts special characters into gene names?! Some biologists just seem to want to make life miserable for us bioinformaticians. :-| )

**areyes** · 01-24-2013, 01:32 AM

Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

And the HTSeq gtf reader does not get confused anymore!

**Artur Jaroszewicz** · 01-24-2013, 11:22 AM

Originally posted by areyes View Post

Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

And the HTSeq gtf reader does not get confused anymore!

I'll use this script for next time I run into a similar problem. For now, I fixed it manually (realized I should use Perl for this, and learning Perl will take much longer than changing the approximately 20 genes that were giving me problems by hand). For now, if anyone runs into a similar issue, you can use my fixed version: https://docs.google.com/file/d/0B4Em...dvc0kycU0/edit

Thanks!

**Jeremy** · 01-24-2013, 06:08 PM

I guess it's no longer important, but the reason I was asking which command you used is that I was going to suggest switching the attribute using the -i command. Changing it to gene_id would also have fixed the problem, but then you would need to convert if you wanted gene_name.

**JustinH** · 09-26-2014, 10:57 AM

Originally posted by areyes View Post

Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

And the HTSeq gtf reader does not get confused anymore!

Areyes,

When I used your perl script on my GTF, the gene names with semi-colons were completely replaced by a dash. Do you know why this could have occured? I am using the TAIR10.22 A. thaliana annotation file. Any help is appreciated.

Justin

**Simon Anders** · 09-26-2014, 11:23 AM

That was the point of Alejandro's script: to replace the apostrophes and semicolons that confused the parser with something else. I have fixed the parser a while ago, however, so there is no need any more for this.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

HTSeq: mismatched quotes issue?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News