HTSeq: A Python framework to work with high-throughput sequencing data

yh253 replied

05-19-2010, 07:18 AM
Hi Simon,

I was trying to use htseq-qa to assess the technique quality of my aligned sam file, but I've encountered the following errors. While, when I used the command on my solexa-fastq file, I got the quality plot successfully. My sam file was generated by bwa-0.5.7.

$htseq-qa -t sam q -r 30 s_8.sam
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.6/bin/htseq-qa", line 5, in <module>
pkg_resources.run_script('HTSeq==0.4.3-p4', 'htseq-qa')
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/pkg_resources.py", line 489, in run_script
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/pkg_resources.py", line 1207, in run_script
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/HTSeq-0.4.3_p4-py2.6-macosx-10.3-fat.egg/EGG-INFO/scripts/htseq-qa", line 5, in <module>
HTSeq.scripts.qa.main()
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/HTSeq-0.4.3_p4-py2.6-macosx-10.3-fat.egg/HTSeq/scripts/qa.py", line 124, in main
r.add_qual_to_count_array( qual_arr_A )
File "_HTSeq.pyx", line 715, in _HTSeq.SequenceWithQualities.add_qual_to_count_array (src/_HTSeq.c:12251)
File "_HTSeq.pyx", line 734, in _HTSeq.SequenceWithQualities.add_qual_to_count_array (src/_HTSeq.c:12169)
ValueError: Too large quality value encountered.

$ htseq-qa -t solexa-fastq -r 30 s_8_sequence.txt (This time, it works with fastq file).

I am not sure is this a problem with my BWA alignment or with htseq-qa. It would be very much appreciated if you could put some of your input here!

Yuan
Leave a comment:
marcora replied

05-17-2010, 05:24 AM
Dear Simon,

thanx for the clear explanation. Is the lax part of GFF that makes going from GTF to GFF "difficult" sometimes, for example when a piece of software requires GFF with specific "comments" (tophat?).

Thanx again for your time and consideration,

Dado
Leave a comment:
Simon Anders replied

05-17-2010, 04:25 AM
Hi Marcora

Yes, there is a very robust GTF->GFF converter available: Just don't do anything, because every GTF file is a GFF file as well.

GTF is a tightening of the GFF specification. This means: If your file has tab-separated fields with the contents <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments], it is a GFF file. The GFF specs are a bit lax about how certain columns are to be filled. Should the ID in the attributes field be called "ID" or "gene_ID" or "gene"? Which words should be used in the feature column? If you want to have a general format, it is hard to give clear rules, but once you have agreed that you want to describe not any kind of feature, but specifically gene models, you can be more explicit. This is what the GTF specification does: it explains how precisely a GFF file should look like if it is used to describe gene models, and if a GFF file follows these rules, it is called a GTF file.

Specifically for htseq-count: If you want to count reads in genes and have a GTF file, you can use it out of the box. If you want to count reads in some other kind of feature, and your GFF file hence cannot follow the GTF specs, you have to tell htseq-count which feature types it should use and how the field with the ID is named. (By default, it takes the lines with feature type "exon" and looks for the ID in the attribute field "gene_id", which is what makes sense for GTF files.)

I hope that clarifies it.

Simon
Leave a comment:
marcora replied

05-17-2010, 03:18 AM
Thanx a lot Simon.

One more thing. It is unclear from your comments here and from the doc online whether HTseq handles both GTF and GFF interchangeably. I am new to this bioinformatics business, and already all these formats are giving me an headache, expecially when GTF files are easily available but no standard/robust GTF>GFF converter is readily available.

Cheers
Leave a comment:
joro replied

05-17-2010, 12:06 AM
Great - thanks for the quick reply.
Leave a comment:
Simon Anders replied

05-16-2010, 11:14 PM
Hi

Originally posted by marcora View Post

This is the error I am getting:

Code:

Error: invalid literal for int() with base 10: '0.000000' [Exception type: ValueError, raised in __init__.py:200]

I noticed this bug myself just yesterday and fixed it. Please try again with version 0.4.3-p4 and tell me whether this solves the issue.

Cheers
Simon
Leave a comment:
marcora replied

05-16-2010, 03:24 PM
Dear Simon,

thanx for this package.

So far everything works except when I try to use htseq-count using tophat output sam file as input and a refseq gff file that has worked just fine with tophat.

This is the error I am getting:

Code:

Error: invalid literal for int() with base 10: '0.000000' [Exception type: ValueError, raised in __init__.py:200]
Leave a comment:
Simon Anders replied

05-15-2010, 10:56 AM
Originally posted by joro View Post

\I am using htseq-count with the -q option. However I'm still getting warnings telling me that htseq-count encountered a read, which has been aligned to a chromosome that did not appear in the GFF file.

I've just fixed this. It now runs properly quiet with the '-q' option.

Simon
Leave a comment:
joro replied

05-13-2010, 12:43 AM
Hi Simon,

I am using htseq-count with the -q option. However I'm still getting warnings telling me that htseq-count encountered a read, which has been aligned to a chromosome that did not appear in the GFF file. Any ideas on how to resolve this?

The command I'm using is:
htseq-count -q <sam_file> <gff_file>
Leave a comment:
Thomas Doktor replied

04-29-2010, 04:19 PM
Hi Simon

Thanks for the updated source. I did sort my sam file prior to analysis and although most read pairs seem to be in adjacent lines, some reads are lacking a mate. I suspect this is because TopHat does not discard unmated reads. The question is if I should remove these unmated reads or if the script considers them in the read count and merely displays a warning?

As it turns out, my GFF file is lacking the mitochondrial encoded genes.

On another note, the script seems to read the GFF file before checking if the sam file exists and as my GFF file is quite large it takes a couple of minutes for the script to exit when I - as happens - sometimes forget to supply an existing sam file. I think it would be nice to have the script check that the files exist as the first thing and then exit immediately upon error.
Leave a comment:
Simon Anders replied

04-29-2010, 12:24 PM
Hi Thomas

Originally posted by Thomas Doktor View Post

I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
Is this an error in my sam file and if so how can I identify the read in question?

Line 100 in count.py, where the error occured, was supposed to tell you that htseq-count encountered a read, which has been aligned to a chromosome that did not appear in the GFF file, and that the read has hence been skipped. I've corrected the bug, in version 0.4.2-p4, you not get a proper warning, telling you the ID of the offending read and the name of the unknown chromosome. Please try again.

As for the warnings about improper pairs: Have you sorted your SAM file before calling htseq-count? This is necessary to make sure that the read pairs appear in adjacent lines (see man page).

Simon
Leave a comment:
spenthil replied

04-29-2010, 11:38 AM
Originally posted by Thomas Doktor View Post

I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
Is this an error in my sam file and if so how can I identify the read in question?

Could you post the relevant code?
Leave a comment:
Thomas Doktor replied

04-29-2010, 09:30 AM
I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:

Error: 'tuple' object has no attribute 'read'
[Exception type: AttributeError, raised in count.py:100]

Is this an error in my sam file and if so how can I identify the read in question?

Last edited by Thomas Doktor; 04-29-2010, 09:32 AM.
Leave a comment:
fennan replied

04-28-2010, 01:33 AM
Nice work. I will start using it. I'll report my experience!
Leave a comment:
Simon Anders replied

04-26-2010, 04:07 AM
Hi tcezard

thanks for the bug report and the nice code example. I've found and fixed the bug and uploaded a new release path, version now 0.4.2-p3.

If you find further bugs, just send me an e-mail.

Cheers
Simon (anders at embl dot de )
Leave a comment:

Previous 1 2 9 10 11 12 13 template Next

Advancing Precision Medicine for Rare Diseases in Children

by seqadmin

Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
- Channel: Articles
12-16-2024, 07:57 AM
Recent Advances in Sequencing Technologies

by seqadmin

Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...
- Channel: Articles
12-02-2024, 01:49 PM

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 48 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News