SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count for sam and gff3 sofia17 RNA Sequencing 45 11-04-2016 03:32 PM
DEXSeq Using Counts File From htseq-count FuzzyCoder Bioinformatics 20 01-03-2016 11:18 PM
Strange error when using htseq-count shhuang Bioinformatics 13 11-19-2012 12:40 AM
Count difference by htseq-cout and samtools DZhang Bioinformatics 2 07-03-2011 11:05 AM
Htseq-count Vs CountOverlap function in IRanges biofreak General 5 06-29-2011 10:36 AM

Reply
 
Thread Tools
Old 02-17-2011, 10:39 AM   #1
MDonlin
Member
 
Location: St. Louis, MO

Join Date: May 2010
Posts: 14
Default Error with GTF file when using htseq-count

Hi,

Just finished installing HTSeq on a MacOSX with python 2.6.6 and latest version of Numpy.

I can execute the first few commands of the HTSeq tour using the yeast example sequence file so the install seems to be working

I invoked the htseq-counts script using the following:
>python -m HTSeq.scripts.count 45minCt_1.sam cneoh99.gtf

and I get the following error:
Error occured in line 1 of file cneoh99.gtf.
Error: The attribute string seems to contain mismatched quotes.
[Exception type: ValueError, raised in __init__.py:167]

The first few lines of my gtf file looks like:
Chr1 CNA2_FINAL_CALLGENES_1 start_codon 11499 11501 . - 0 "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"
Chr1 CNA2_FINAL_CALLGENES_1 stop_codon 11060 11062 . - 0 "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"
Chr1 CNA2_FINAL_CALLGENES_1 exon 11430 11501 . - . "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"

I've attached an excerpt of the file.
Do I need headers in this file?

Thanks for any help.

Regards,
Maureen
Attached Files
File Type: txt cneo_gtfexcerpt.txt (1.1 KB, 23 views)
MDonlin is offline   Reply With Quote
Old 02-17-2011, 11:14 PM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Well, there obviously are mismatched quotes in your attribute strings. In a proper GTF file, the first line should look like this:

Code:
Chr1   CNA2_FINAL_CALLGENES_1   start_codon   11499   11501   .   -   0   gene_id "CNAG_00001"; transcript_id "CNAG_00001T0"
All these extra quotes make little sense and are confusing to HTSeq. It actually looks a bit as if you loaded the file with a spreadsheet program and saved it again. Doing something like this might introduce extra quotes.

Where did you get the GTF file from?
Simon Anders is offline   Reply With Quote
Old 01-16-2013, 08:44 PM   #3
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default Same problem, different GTF

Hi Simon,

I was wondering if you could possibly help me with my problem. I downloaded the arabidopsis thaliana ensembl gtf from plants.ensembl.org. Here's a sample:

Quote:
1 protein_coding CDS 30424421 30424675 . + 0 gene_id "AT1G80990"; transcript_id "AT1G80990.1"; exon_number "1"; gene_name "AT1G80990"; transcript_name "AT1G80990.1"; protein_id "AT1G80990.1";
1 protein_coding start_codon 30424421 30424423 . + 0 gene_id "AT1G80990"; transcript_id "AT1G80990.1"; exon_number "1"; gene_name "AT1G80990"; transcript_name "AT1G80990.1";
When I try to run HTSeq, it gives me the same error as above:

Quote:
Traceback (most recent call last):
File "python_scripts/sam_to_gene_array_2.py", line 80, in <module>
main()
File "python_scripts/sam_to_gene_array_2.py", line 41, in main
for feature in gtf:
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
raise ValueError, "The attribute string seems to contain mismatched quotes."
ValueError: The attribute string seems to contain mismatched quotes.
Any ideas why this could be happening? Thank you in advance, and thank you for all your help in the past.

Best Regards,
Artur Jaroszewicz
Artur Jaroszewicz is offline   Reply With Quote
Old 01-17-2013, 02:26 AM   #4
DonDolowy
Member
 
Location: Freiburg

Join Date: Oct 2012
Posts: 56
Default

If you download the GTF from the iGenomes, it should work:
http://tophat.cbcb.umd.edu/igenomes.html
DonDolowy is offline   Reply With Quote
Old 01-17-2013, 10:57 PM   #5
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Still getting the same error:
Quote:
Traceback (most recent call last):
File "/u/home/mcdb/arturj/python_scripts/sam_to_gene_array_2.py", line 80, in <module>
main()
File "/u/home/mcdb/arturj/python_scripts/sam_to_gene_array_2.py", line 41, in main
for feature in gtf:
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
raise ValueError, "The attribute string seems to contain mismatched quotes."
ValueError: The attribute string seems to contain mismatched quotes.
Any other suggestions?
Artur Jaroszewicz is offline   Reply With Quote
Old 02-07-2013, 08:45 PM   #6
Mahtab
Member
 
Location: University of Melbourne

Join Date: Aug 2011
Posts: 10
Default

I have the same problem with arabidopsis and RNASeq in Galaxy and I have used different GTF files from ensembl and arabidopsis.org.

Any ideas?


Thanks
Mahtab is offline   Reply With Quote
Old 02-07-2013, 09:30 PM   #7
Artur Jaroszewicz
Member
 
Location: Los Angeles

Join Date: Sep 2011
Posts: 45
Default

Hi Mahtab,

Yes, I actually solved the problem. I thought I had posted the solution to my problem, but evidently not. I guess there was another thread that I started. Anyway, there's maybe 100 lines or so that have semicolons in the gene id of the attribute fields, so I wrote a quick script to take care of it. If you'd like to use my modified gtf, you can download it at:
http://pellegrini.mcdb.ucla.edu/Artu...10.ensembl.gtf

Good luck in your analysis!

Artur
Artur Jaroszewicz is offline   Reply With Quote
Old 02-08-2013, 05:18 AM   #8
jparsons
Member
 
Location: SF Bay Area

Join Date: Feb 2012
Posts: 62
Default

Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?

I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right?
jparsons is offline   Reply With Quote
Old 02-08-2013, 07:09 AM   #9
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,169
Default

Quote:
Originally Posted by jparsons View Post
Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?

I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right?
There is a standard defined for GTF files. The problem isn't the standard, it's when people create files that do not conform to that standard, e.g. including a semicolon in your gene_id.
kmcarr is offline   Reply With Quote
Old 02-11-2013, 08:56 PM   #10
Mahtab
Member
 
Location: University of Melbourne

Join Date: Aug 2011
Posts: 10
Default

Hi Artur,

Thank you very much for your help. It worked!
I had seen the other thread and downloaded the gft from there but for some reason I was still getting the same error.

Thanks again
Mahtab
Mahtab is offline   Reply With Quote
Old 09-22-2013, 10:33 AM   #11
mslider
Junior Member
 
Location: france

Join Date: Sep 2010
Posts: 24
Default

--Hi,

i have a similar problem with gtf file using htseq-count (version 0.5.4p3):

samtools view BNV13.sorted.bam | htseq-count -m intersection-nonempty -s no - Rattus_norvegicus.gtf
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
525298 GFF lines processed.
Error: 'itertools.chain' object has no attribute 'get_line_number_string'
[Exception type: AttributeError, raised in count.py:201]

first lines of gtf file:

AABR06112227.1 pseudogene exon 345 455 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "1"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000476932";
AABR06112227.1 pseudogene exon 157 342 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "2"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000024118";
AABR06112227.1 pseudogene exon 86 154 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "3"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000470172";
AABR06111321.1 miRNA exon 71 156 . + . gene_id "ENSRNOG00000045547"; transcript_id "ENSRNOT00000070977"; exon_number "1"; gene_biotype "miRNA";
exon_id "ENSRNOE00000464516";
AABR06111321.1 pseudogene exon 170 424 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "1"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000256162";
AABR06111321.1 pseudogene exon 429 434 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "2"; gene_biotype "
pseudogene"; exon_id "ENSRNOE00000472450";
AABR06111841.1 miRNA exon 87 210 . - . gene_id "ENSRNOG00000046613"; transcript_id "ENSRNOT00000072639"; exon_number "1"; gene_biotype "miRNA";
exon_id "ENSRNOE00000503423";
AABR06110665.1 protein_coding exon 343 613 . - . gene_id "ENSRNOG00000048972"; transcript_id "ENSRNOT00000061381"; exon_number "1"; gene_name "H2-

is there something to do ?

thank you --
mslider is offline   Reply With Quote
Old 09-25-2013, 06:55 AM   #12
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

It's a problem with your BAM file.

There is a bug in the code that writes the error message which appears only when you read the SAM file from standard input. I'll fix this in the next release. For now, please convert your BAM file to a SAM file, and put the SAM file's name instead of the "-". Then, you should be able to see the actual error message.
Simon Anders is offline   Reply With Quote
Old 09-25-2013, 10:19 AM   #13
mslider
Junior Member
 
Location: france

Join Date: Sep 2010
Posts: 24
Default Error with GTF file when using htseq-count

--

my problem is over,
i've fixed it using samtools view -f 0x2 input.bam | htseq-count .....
with the option -f 0x2 all reads not properly paired are discarded.
So, in this circonstance the problem is not due to SAM file read from standard input. This bam file was produced by tophat2, maybe a bug of tophat !?

Laurent --
mslider is offline   Reply With Quote
Old 01-13-2015, 08:29 AM   #14
jshaik
Junior Member
 
Location: st. louis

Join Date: Jun 2011
Posts: 6
Default

When i had this error, i removed the fasta sequences from my gff file (the sequences at the end of gff) and it worked!
jshaik is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:08 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO