SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and dan Literature Watch 1 11-09-2011 05:18 AM
ChIP-Seq: Systematic bias in high-throughput sequencing data and its correction by BE Newsbot! Literature Watch 0 06-08-2011 03:50 AM
ChIP-Seq: Data structures and compression algorithms for high-throughput sequencing t Newsbot! Literature Watch 0 10-16-2010 03:00 AM
ChIP-Seq: Savant: Genome Browser for High Throughput Sequencing Data. Newsbot! Literature Watch 0 06-22-2010 03:00 AM
Dave - interested in trial data sets and high-throughput sequencing tutorials Dave S. Introductions 4 02-02-2010 11:04 AM

Reply
 
Thread Tools
Old 04-22-2010, 06:10 AM   #1
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default HTSeq: A Python framework to work with high-throughput sequencing data

Hi

I would like to advertise the release of HTSeq, a Python framework to process and analyse high-throughput sequencing (HTS) data.

With the many short-read aligners available now, HTS analysis seems simple. In practice, however, data often needs to be converted, tweaked, filtered, or otherwise pre-processed before they can be given to the aligner, and the results require similar processing to do the statistical analysis one needs.

HTSeq is meant to render such tasks easy and convenient, and so act as a "glue" between aligners and other existing tools.

Some examples of typical use cases for HTSeq:
  • Quality assessment of reads: Check the dependence of the proportions of base calls and quality scores on the position in the reads, stratify by alignment status.
  • Counting: How many reads fall onto each exon, or each gene? For such tasks, you may want to design and implement rules on how to deal with overlapping features or ambiguous assignments.
  • Calculating coverage: HTSeq helps you not only to produce a Wiggle file for visualization in a genome browser, but also to do customized statistics on this.
  • Multiple alignments: Many aligners can output multiple alignments for each read, but what to do with this? HTSeq makes it easy to implement post-processing to choose the right alignment according to your criteria.
  • Adapter trimming: In miRNA-Seq, you often sequence into the adapter at the other end and need to cut this off before aligning. In multiplexed sequencing, you may need to cut off and sort by the mutiplex tag.

Have a look and give it a try: http://www-huber.embl.de/users/anders/HTSeq/

To use HTSeq you only need a basic understanding of Python, as can be obtained by reading the first few chapters of a Python book.
For users without programming knowledge, stand-alone scripts for common tasks are provided: htseq-count to count the overlap of reads with features (such as exons), htseq-qa to get a quick overview of the quality of your sequencing run, and htseq-bedgraph (coming soon) to convert an alignment file into a Bedgraph Wiggle file for visualization with a genome browser.

For programmers, HTSeq has been designed to keep thing simple:
  • All classes have extensive reference documentation, and a tutorial demonstrates their use.
  • All supported file formats (Fasta, Fastq, SAM, SolexaPipeline files, GFF, GTF, etc.) can be read in a loop, providing an object describing one record at a time to the loop body. This object describes the data in a convenient and consistent way.
  • The 'GenomicArray' class is the Swiss army knife of HTSeq. It is a container that can efficiently store anything that has a position on the genome: integer number to represent coverage, objects with feature data to represent exons, sets of objects to handle overlapping features, etc.

Please let me know what you think of it.

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 04-22-2010, 07:10 AM   #2
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Nice one, looks interesting! I like the look of the quality plots.

Do you have any plans to try and integrate with the Biopython project? There's some obvious overlap that I can see.
nickloman is offline   Reply With Quote
Old 04-22-2010, 07:55 AM   #3
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

I'm trying to run htseq-qa, but get the following error:
Quote:
Traceback (most recent call last):
File "/usr/local/bin/htseq-qa", line 5, in <module>
pkg_resources.run_script('HTSeq==0.4.2-p1', 'htseq-qa')
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script
if dist.key not in keys2:
File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script

File "/usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/EGG-INFO/scripts/htseq-qa", line 3, in <module>
import HTSeq.scripts.qa
File "/usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 8, in <module>
from _HTSeq import *
ImportError: /usr/local/lib/python2.6/dist-packages/HTSeq-0.4.2_p1-py2.6-linux-x86_64.egg/HTSeq/_HTSeq.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8
I've installed the dependencies and I'm running Python 2.6.4 - any idea what I could be doing wrong?
Thomas Doktor is offline   Reply With Quote
Old 04-22-2010, 08:00 AM   #4
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Quote:
Originally Posted by Simon Anders View Post
Hi

I would like to advertise the release of HTSeq, a Python framework to process and analyse high-throughput sequencing (HTS) data.
Nice one! Trying it ASAP!

d
dawe is offline   Reply With Quote
Old 04-22-2010, 08:16 AM   #5
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

The error I encountered was resolved by compiling the package from source, thanks for the help and the package Simon.
Thomas Doktor is offline   Reply With Quote
Old 04-22-2010, 10:31 AM   #6
spenthil
Member
 
Location: San Francisco

Join Date: Sep 2009
Posts: 44
Default

Perfect!

Side note - pip is an awesome easy_install replacement that works great with virtualenv. To use it to install: `pip install HTseq`
__________________
--
Senthil Palanisami
spenthil is offline   Reply With Quote
Old 04-22-2010, 01:54 PM   #7
lvaruzza
Junior Member
 
Location: Brazil

Join Date: Feb 2008
Posts: 8
Default

Does this package supports reads in SOLiD Color Space?
lvaruzza is offline   Reply With Quote
Old 04-22-2010, 11:38 PM   #8
Siva
Member
 
Location: Ames, IA

Join Date: Nov 2008
Posts: 57
Default

Hi Simon
In HTSeq count is it ok to use Bowtie SAM output against the corresponding Cufflinks generated GTF file? I get an error message regarding the Cufflinks GTF file saying CUFF1.0 doesn't contain gene_id attribute. I feel I must supply a standard GFF/GTF annotation file from the database. I am not sure a comprehensive GTF file exists for maize.
Siva is offline   Reply With Quote
Old 04-23-2010, 02:31 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default

Hi Siva

Quote:
Originally Posted by Siva View Post
In HTSeq count is it ok to use Bowtie SAM output against the corresponding Cufflinks generated GTF file? I get an error message regarding the Cufflinks GTF file saying CUFF1.0 doesn't contain gene_id attribute.
Thanks for the report, I've just investigated and corrected. The real issue is that in the cufflinks GTF file, the genes have no strand information, and hence, htseq-count has to be called with the '--stranded=no' information.

Sorry for the misleading error message. I've just uploaded a fix, which will now display the more helpful message Feature CUFF.1 at chr1:[1047,1108)/. does not have strand information but you are running htseq-count in stranded mode. Use '--stranded=no'..

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 04-23-2010, 02:37 AM   #10
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default

Hi

Quote:
Originally Posted by lvaruzza View Post
Does this package supports reads in SOLiD Color Space?
Yes and no. ;-)

There are no specific facilities for color space yet, but HTSeq should nevertheless be helpful to work with SOLiD data. For example, you can use the FastaReader and FastqReader classes can also be used to read in color-space files, and the GFF_Reader class can deal with the GFF files output by the WTAP aligner.

I don't have much experience with SOLiD data, but I'd be interested to collaborate with a SOLiD-using bioinformatician to fill in the gaps in order to make HTSeq really useful for colour-space data. I think not much is missing for this.

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 04-25-2010, 07:59 PM   #11
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 116
Default

Cool. Thanks. Will be back with questions.
Melissa is offline   Reply With Quote
Old 04-26-2010, 04:39 AM   #12
tcezard
Member
 
Location: Edinburgh

Join Date: Dec 2008
Posts: 13
Default

I believe I found a bug or I did not understand the documentation.
maybe there is a better place for bug report but I couldn't find a mailing list of bug tracker
My uderstanding of the Sequence trimming is ti looks for a match (allowing mismatches) between the leftpart of the read and the right part of the adapter. but it seems to fail finding a match as long as the adapter.

Here are a few command that reproduce the issue:
Code:
>>> from HTSeq import Sequence
>>> read=Sequence('ACACGTTCGATATCCCGTATGCAACGGACCCGGCAGGAAACCGGCTGTGGG')
>>> adapter1=Sequence('ACACGT')
>>> adapter2=Sequence('AACACGT')
>>> print read.seq.startswith(adapter1.seq)
True
>>> print read.seq.startswith(adapter2.seq)
False
>>> print read.trim_left_end(adapter1)
ACACGTTCGATATCCCGTATGCAACGGACCCGGCAGGAAACCGGCTGTGGG
>>> print read.trim_left_end(adapter2)
TCGATATCCCGTATGCAACGGACCCGGCAGGAAACCGGCTGTGGG

Last edited by tcezard; 04-26-2010 at 04:41 AM. Reason: add code output
tcezard is offline   Reply With Quote
Old 04-26-2010, 05:07 AM   #13
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default

Hi tcezard

thanks for the bug report and the nice code example. I've found and fixed the bug and uploaded a new release path, version now 0.4.2-p3.

If you find further bugs, just send me an e-mail.

Cheers
Simon (anders at embl dot de )
Simon Anders is offline   Reply With Quote
Old 04-28-2010, 02:33 AM   #14
fennan
Member
 
Location: Germany

Join Date: Apr 2010
Posts: 19
Default

Nice work. I will start using it. I'll report my experience!
fennan is offline   Reply With Quote
Old 04-29-2010, 10:30 AM   #15
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
Quote:
Error: 'tuple' object has no attribute 'read'
[Exception type: AttributeError, raised in count.py:100]
Is this an error in my sam file and if so how can I identify the read in question?

Last edited by Thomas Doktor; 04-29-2010 at 10:32 AM.
Thomas Doktor is offline   Reply With Quote
Old 04-29-2010, 12:38 PM   #16
spenthil
Member
 
Location: San Francisco

Join Date: Sep 2009
Posts: 44
Default

Quote:
Originally Posted by Thomas Doktor View Post
I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
Is this an error in my sam file and if so how can I identify the read in question?
Could you post the relevant code?
__________________
--
Senthil Palanisami
spenthil is offline   Reply With Quote
Old 04-29-2010, 01:24 PM   #17
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default

Hi Thomas

Quote:
Originally Posted by Thomas Doktor View Post
I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
Is this an error in my sam file and if so how can I identify the read in question?
Line 100 in count.py, where the error occured, was supposed to tell you that htseq-count encountered a read, which has been aligned to a chromosome that did not appear in the GFF file, and that the read has hence been skipped. I've corrected the bug, in version 0.4.2-p4, you not get a proper warning, telling you the ID of the offending read and the name of the unknown chromosome. Please try again.

As for the warnings about improper pairs: Have you sorted your SAM file before calling htseq-count? This is necessary to make sure that the read pairs appear in adjacent lines (see man page).

Simon
Simon Anders is offline   Reply With Quote
Old 04-29-2010, 05:19 PM   #18
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

Hi Simon

Thanks for the updated source. I did sort my sam file prior to analysis and although most read pairs seem to be in adjacent lines, some reads are lacking a mate. I suspect this is because TopHat does not discard unmated reads. The question is if I should remove these unmated reads or if the script considers them in the read count and merely displays a warning?

As it turns out, my GFF file is lacking the mitochondrial encoded genes.

On another note, the script seems to read the GFF file before checking if the sam file exists and as my GFF file is quite large it takes a couple of minutes for the script to exit when I - as happens - sometimes forget to supply an existing sam file. I think it would be nice to have the script check that the files exist as the first thing and then exit immediately upon error.
Thomas Doktor is offline   Reply With Quote
Old 05-13-2010, 01:43 AM   #19
joro
Member
 
Location: UK

Join Date: Feb 2010
Posts: 28
Default

Hi Simon,

I am using htseq-count with the -q option. However I'm still getting warnings telling me that htseq-count encountered a read, which has been aligned to a chromosome that did not appear in the GFF file. Any ideas on how to resolve this?

The command I'm using is:
htseq-count -q <sam_file> <gff_file>
joro is offline   Reply With Quote
Old 05-15-2010, 11:56 AM   #20
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default

Quote:
Originally Posted by joro View Post
\I am using htseq-count with the -q option. However I'm still getting warnings telling me that htseq-count encountered a read, which has been aligned to a chromosome that did not appear in the GFF file.
I've just fixed this. It now runs properly quiet with the '-q' option.

Simon
Simon Anders is offline   Reply With Quote
Reply

Tags
htseq, python

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:32 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO