SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
MV emPCR issue ajthomas 454 Pyrosequencing 1 10-23-2012 08:12 PM
Issue with htseq-count cpleis Bioinformatics 8 10-15-2012 09:31 AM
Issue with LiftOver ashkot Bioinformatics 1 05-17-2012 10:25 AM
CisGenome Issue Nicholas_ Bioinformatics 0 10-10-2011 01:24 AM
Best aligner for 20% mismatched? feederbing Bioinformatics 21 09-16-2011 06:51 AM

Reply
 
Thread Tools
Old 01-23-2013, 03:25 PM   #1
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Exclamation HTSeq: mismatched quotes issue?

Hi all,

I have been using HTSeq and DESeq for my RNA Seq pipeline at UCLA, and everything is great except for one thing: I am having trouble with counting reads with HTSeq for Arabidopsis Thaliana. I have so far used two different GTFs, and they both return the same error:

Quote:
Traceback (most recent call last):
File "python_scripts/sam_to_gene_array_2.py", line 80, in <module>
main()
File "python_scripts/sam_to_gene_array_2.py", line 41, in main
for feature in gtf:
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
raise ValueError, "The attribute string seems to contain mismatched quotes."
ValueError: The attribute string seems to contain mismatched quotes.
I've looked through the gtf, and I haven't been able to find any mismatched quotes or anything out of the ordinary, except for the attribute line, which has slightly different fields (does not have gene_biotype, has p_id and tss_id). Here's a sample of the gtf file I'm using:

Quote:
1 protein_coding exon 3631 3913 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; seqedit "false"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
1 protein_coding CDS 3760 3913 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
1 protein_coding start_codon 3760 3762 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
1 protein_coding CDS 3996 4276 . + 2 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "ANAC001"; p_id "P21642"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
Has anyone run into the same problem? I've used two different GTFs based on someone else's suggestion, one from Ensembl plants and another from iGenomes, and they both have the same error.

Any input or suggestions would be greatly appreciated. I've posted this on a different thread, but it was a bit of a different issue, so I've started a new one.
Artur Jaroszewicz is offline   Reply With Quote
Old 01-23-2013, 04:23 PM   #2
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

Might help if you list the commands you used for HT-Seq.
Jeremy is offline   Reply With Quote
Old 01-23-2013, 04:38 PM   #3
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

I would use the gff files from TAIR as those will be the most up to date and original files.

TAIR10 can be found here, it is the most recent genome version: ftp://ftp.arabidopsis.org/../../Gene...e/TAIR10_gff3/

TAIR9 is still often used, it is here:ftp://ftp.arabidopsis.org/../../Gene...se/TAIR9_gff3/

The TAIR gff files have some slight differences requiring you to give HTSeq some additional details, but they work great.
chadn737 is offline   Reply With Quote
Old 01-23-2013, 04:58 PM   #4
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

I mean what commands
Code:
htseq-count [options] <sam_file> <gff_file>
for example
Code:
htseq-count -t gene -i gene_id <sam_file> <gff_file>
Jeremy is offline   Reply With Quote
Old 01-23-2013, 04:59 PM   #5
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Quote:
Originally Posted by chadn737 View Post
I would use the gff files from TAIR as those will be the most up to date and original files.

TAIR10 can be found here, it is the most recent genome version: ftp://ftp.arabidopsis.org/../../Gene...e/TAIR10_gff3/

TAIR9 is still often used, it is here:ftp://ftp.arabidopsis.org/../../Gene...se/TAIR9_gff3/

The TAIR gff files have some slight differences requiring you to give HTSeq some additional details, but they work great.
Thanks for the input, I'm trying it now. To the previous response, I pretty much followed Anders' example on this:
Quote:
#!/usr/bin/python2.7 -tt

import HTSeq
import itertools
import sys

def main():

sys.stderr.write("Initializing program.\n")

#human
if sys.argv[1] == 'hg19':
gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/hg19.ensembl.gtf")
#mouse
elif sys.argv[1] == 'mm9':
gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/mm9.ensembl.gtf")
elif sys.argv[1] == 'TAIR10':
gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/TAIR10_GFF3_genes.gff")
else:
gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/" + sys.argv[1] + ".ensembl.gtf")
####
'''
path = sys.argv[2]
if not path[-1] == '/':
path += '/'
'''
filelist = []
for file in sys.argv[2:]:
name_split = file.split('/')
filelist.append(name_split[-1])
if len(name_split) > 1:
path = '/'.join(name_split[:-1]) + '/'
if not path[0] == '/':
path = '/' + path
else:
path = './'
samplelist = []
for entry in filelist:
samplelist.append(entry[:-18])
####

exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True)
for feature in gtf:
if feature.type == "exon":
exons[feature.iv] += feature.name

sys.stderr.write("Created exon array.\n")

counts = {}
for feature in gtf:
if feature.type == "exon":
counts[feature.name] = [0] * len(filelist)

sys.stderr.write("Initialized count dict.\n")

for filenum in range(len(filelist)):
sam_file = HTSeq.SAM_Reader( path + filelist[filenum] )
for alnmt in sam_file:
if alnmt.aligned:
intersection_set = None
for iv2, step_set in exons[alnmt.iv].steps():
if intersection_set is None:
intersection_set = step_set.copy()
else:
intersection_set.intersection_update(step_set)
if len(intersection_set) == 1:
counts[list(intersection_set)[0]][filenum] += 1
sys.stderr.write("Counted hits per gene for " + samplelist[filenum] + ".\n")

sys.stderr.write("Printing output.\n")

sys.stdout.write('Gene\t' + '\t'.join(samplelist) + '\n')
for gene in sorted(counts.keys()):
sys.stdout.write(gene)
for filenum in range(len(filelist)):
sys.stdout.write('\t' + str(counts[gene][filenum]))
sys.stdout.write('\n')

sys.stderr.write("Done.\n")

if __name__ == "__main__":
main()
Artur
Artur Jaroszewicz is offline   Reply With Quote
Old 01-23-2013, 05:12 PM   #6
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

I seem to have found my problem -- one of the lines in the GTF file has a semicolon in the gene_name: gene_name "PIP1;3"

I hope this resolves my problem!
Artur Jaroszewicz is offline   Reply With Quote
Old 01-24-2013, 12:20 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Ah, that might explain it. On the other hand, maybe not: a semicolon is not a quote. However, I have a dim recollection that I have once seen a prime (a.k.a. single quote: ' ) in an Arabidopsis gene name. So, maybe check for this, too.

(Who puts special characters into gene names?! Some biologists just seem to want to make life miserable for us bioinformaticians. :-| )
Simon Anders is offline   Reply With Quote
Old 01-24-2013, 12:32 AM   #8
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

And the HTSeq gtf reader does not get confused anymore!

Last edited by areyes; 01-24-2013 at 01:32 AM.
areyes is offline   Reply With Quote
Old 01-24-2013, 10:22 AM   #9
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Quote:
Originally Posted by areyes View Post
Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

And the HTSeq gtf reader does not get confused anymore!
I'll use this script for next time I run into a similar problem. For now, I fixed it manually (realized I should use Perl for this, and learning Perl will take much longer than changing the approximately 20 genes that were giving me problems by hand). For now, if anyone runs into a similar issue, you can use my fixed version: https://docs.google.com/file/d/0B4Em...dvc0kycU0/edit

Thanks!
Artur Jaroszewicz is offline   Reply With Quote
Old 01-24-2013, 05:08 PM   #10
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

I guess it's no longer important, but the reason I was asking which command you used is that I was going to suggest switching the attribute using the -i command. Changing it to gene_id would also have fixed the problem, but then you would need to convert if you wanted gene_name.
Jeremy is offline   Reply With Quote
Old 09-26-2014, 10:57 AM   #11
JustinH
Junior Member
 
Location: Michigan

Join Date: Aug 2014
Posts: 7
Default

Quote:
Originally Posted by areyes View Post
Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

And the HTSeq gtf reader does not get confused anymore!
Areyes,

When I used your perl script on my GTF, the gene names with semi-colons were completely replaced by a dash. Do you know why this could have occured? I am using the TAIR10.22 A. thaliana annotation file. Any help is appreciated.

Justin
JustinH is offline   Reply With Quote
Old 09-26-2014, 11:23 AM   #12
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

That was the point of Alejandro's script: to replace the apostrophes and semicolons that confused the parser with something else. I have fixed the parser a while ago, however, so there is no need any more for this.
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:48 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO