Seqanswers Leaderboard Ad

**Simon Anders** · 09-29-2010, 05:39 AM

Originally posted by bbl View Post

HTSeq seems to be the right tool I need for my work. I am trying to count how many times each base in an interesting region of exon has been covered by a read. Can 'counting' function in HTSeq do that? If so, how to do it?

'Bbl' send me a private mail with additional information, but I answer here, in case somebody else is interested, too. He has a SAM file, and a BED file with positions of selcted exons. For these exons, he wants to know the coverage vector.

Here my answer to Bbl:

First of all, HTSeq is not a "program" but a framework that allows you to conveniently write your own scripts. Hence, I have to assume that you are a bioinformatician with at least a bit of experience in programming. If you are not, this might be difficult.

If you already know some Python, read the tutorial (Tour) in the HTSeq documentation. If you don't know Python yet but do know Perl or something similar, maybe look first through the first few pages of the Python Tutorial on the Python web site. You don't need to know much for HTSeq.

Given a SAM file (not BAM), you can calculate the coverage in a HTSeq.GenomicArray, as shown in the section "Calculating coverage vectors" of the Tour.

These here are the lines in the Tour that you need to use and adapt:

Code:

import HTSeq

yeast_chroms = [ "2-micron", "MT", "I", "II", "III", "IV", "V", "VI", 
   "VII", "VIII", "IX", "X", "XI", "XII", "XIII", "XIV", "XV", "XVI" ]

alignment_file = HTSeq.SAM_Reader( "yeast_RNASeq_excerpt.sam" )

cvg = HTSeq.GenomicArray( yeast_chroms, stranded=True, typecode='i' )

for alngt in alignment_file:
    if alngt.aligned:
       cvg.add_value( 1, alngt.iv )

Now, you can query the coverage at a selected interval, e.g.:

Code:

iv = HTSeq.GenomicInterval( "II", 120100, 120120, "+" )
print list( cvg[iv] )

If you do this with my example data, you will just get 20 zeroes, because no reads mapped there.

You can now go through your BED file and do this for each interval given there. I assume that your BED file has four columns, for chromosome, start, end, and name of the exon. (If you have another format, you have to change the line below with the 'split' method.)

Code:

for line in open( "mybedfile.bed" ):

   # Split the line into the four fields:
   chrom, start, end, name = line.rstrip().split( "\t" )

   # Make two intervals, one for '+', one for '-':
   ivPlus  = HTSeq.GenomicInterval( chrom, start, end, "+" )
   ivMinus = HTSeq.GenomicInterval( chrom, start, end, "-" )

   # Write out the results:
   print "Exon", name, ":"
   print "+:", list( cvg[ivPlus] )
   print "-:", list( cvg[ivMinus] )
   print

Is this roughly what you wanted?

In any case, I guess your work is not done here; you might want to calculate statistics or make plots from the output. This should be easy wth Python (incl numpy/scipy and matplotlib), too, so it's worth learning it.

Simon

**joro** · 10-08-2010, 02:56 AM

Hi Simon,

Firstly thanks for this tool we're finding it very useful.

I have a couple of questions:

1. Is there a way to find out which version of htseq-count we're using?

2. Should the sum of column 2 in the htseq-count output file (including no_feature and ambiguous) equal the number of non-header lines in the SAM file?

**Simon Anders** · 10-08-2010, 03:12 AM

Originally posted by joro View Post

1. Is there a way to find out which version of htseq-count we're using?

Yes. Look at the end of the output of 'htseq-count -h'.

2. Should the sum of column 2 in the htseq-count output file (including no_feature and ambiguous) equal the number of non-header lines in the SAM file?

Yes, but only if you use the latest version, which offers you two further counters, namely "unmapped" (for reads which are listed without alignment in the SAM file) and "lowqual" (for reads below the alignment quality threshold, in case you set one).

Simon

**joro** · 10-08-2010, 03:20 AM

Thanks for your quick reply.

I must be missing something because this is the output of 'htseq-count -h':

Code:

$ htseq-count -h
Usage: htseq-count [options] sam_file gff_file

This script takes an alignment file in SAM format and a feature file in GFF
format and calculates for each feature the number of reads mapping to it. See
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.

Options:
  -h, --help            show this help message and exit
  -m MODE, --mode=MODE  mode to handle reads overlapping more than one
                        feature(choices: union, intersection-strict,
                        intersection-nonempty; default: union)
  -t FEATURETYPE, --type=FEATURETYPE
                        feature type (3rd column in GFF file) to be used, all
                        features of other type are ignored (default, suitable
                        for Ensembl GTF files: exon)
  -i IDATTR, --idattr=IDATTR
                        GFF attribute to be used as feature ID (default,
                        suitable for Ensembl GTF files: gene_id)
  -s STRANDED, --stranded=STRANDED
                        whether the data is from a strand-specific assay
                        (default: yes)
  -q, --quiet           suppress progress report

Written by Simon Anders ([email protected]), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework.

What is the best way to get the latest version?

**Simon Anders** · 10-08-2010, 03:49 AM

Hi joro,

this must be quite an old version. I've added the version information to the last sentence of the output of 'htseq-count -h' a few months ago.

Plese follow these instruction to get the newest version: http://www-huber.embl.de/users/ander...c/install.html

Avoid using easy_install. It tends to get outdated versions.

Simon

**joro** · 10-08-2010, 05:40 AM

Thanks, I've got the latest version now and 'htseq-count -h' provides the version information.

I wonder if you could help me with an error I'm now seeing?

Code:

Error: ('Inconsistent CIGAR operation.', 'line 15641 of file output.sam')

The CIGAR string for this line is

Code:

32M113N18M

Can you see why this is causing a problem?

**Simon Anders** · 10-08-2010, 05:46 AM

Yes, that's a bug I introduced in v0.4.5 and fixed, half an hour ago, in v0.4.5p1.

**joro** · 10-08-2010, 06:16 AM

It's running without errors now thanks. However, the sum of column 2 in the htseq-count output file doesn't equal the number of non-header lines in the SAM file. The difference is around 40,000. Do you know what is causing this difference?

**joro** · 10-14-2010, 02:07 AM

Hi Simon,

In htseq-count, if a read maps to an exon shared by two transcripts of the same gene, what will be the count for the gene. For example Exon2 in ENSECAT00000020941 and Exon1 in ENSECAT00000028962 (see below)?

Code:

15    protein_coding    exon    76927858    76928059    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "1"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    exon    76913463    76915467    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    CDS    76913932    76914786    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000017218";
15    protein_coding    start_codon    76914784    76914786    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    stop_codon    76913929    76913931    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    exon    76915073    76915408    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "1"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    CDS    76915073    76915408    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "1"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000023033";
15    protein_coding    start_codon    76915406    76915408    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "1"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    exon    76915024    76915071    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    CDS    76915024    76915071    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000023033";
15    protein_coding    exon    76915011    76915022    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "3"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    CDS    76915011    76915022    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "3"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000023033";
15    protein_coding    exon    76914863    76915009    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "4"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    CDS    76914863    76915009    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "4"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000023033";
15    protein_coding    exon    76913932    76914861    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "5"; gene_name "LOC100056640"; transcript_name "LOC100056640";
15    protein_coding    CDS    76913932    76914861    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "5"; gene_name "LOC100056640"; transcript_name "LOC100056640";

**Simon Anders** · 10-14-2010, 03:52 AM

Originally posted by joro View Post

In htseq-count, if a read maps to an exon shared by two transcripts of the same gene, what will be the count for the gene.

One. If a read overlaps with two exons of the same gene, it will be counted once for this gene. If the read overlaps with two exons of two different genes, it will not be counted for either gene but as "ambiguous".

Two exons are deemed to belong to the same gene, if their lines in the GFF file have the same value for the "gene_id" attribute. The purpose of the '--idattr' command line option is to tell htseq-count to look at another attribute than 'gene_id' for this purpose. The transcript ID is hence ignored unless you tell htseq-count otherwise.

Simon

**joro** · 10-14-2010, 04:02 AM

Thanks, that makes sense.

What about if you used option '--idattr' to look at transcript_id in the example I gave before? i.e. if the read overlaps with two exons of two different transcripts, would it be "ambiguous"?

**Simon Anders** · 10-14-2010, 05:01 AM

Originally posted by joro View Post

What about if you used option '--idattr' to look at transcript_id in the example I gave before? i.e. if the read overlaps with two exons of two different transcripts, would it be "ambiguous"?

I would even be ambiguous if the read overlaps with one exon that appear in two different transcripts and hence appears in two different lines of the GFF files with two transcript IDs.

This means that htseq-count is suitable for gene-level counting only. However, the point of HTSeq is, after all, that you cane easily write your own scripts with a bit of Python knowledge, so if you need to modify the counting logic, you can do so. (Use the Tour in the documentation as starting point.)

Simon

**joro** · 10-14-2010, 05:09 AM

Ok thanks Simon. I'll have a go at changing the counting logic if I need to at any point! I think I only need gene-level counting for now.

**joro** · 10-14-2010, 06:26 AM

Originally posted by joro View Post

It's running without errors now thanks. However, the sum of column 2 in the htseq-count output file doesn't equal the number of non-header lines in the SAM file. The difference is around 40,000. Do you know what is causing this difference?

Thanks for offering to look through my input files but I've resolved this now. There were reads mapping to sequences that weren't defined in the gtf. My mistake!

**agali** · 10-15-2010, 08:12 AM

Hi,

I have seen a post with the same issue before, but I didn't see any replay.. I am working with SOLiD data, aligned with SHRiMP to the reference mRNA library.
I am getting this error message with htseq-count:

Error occured in line 16241 of file F_mrna.sam.
Error: ("'seq' and 'qualstr' do not have the same length.", 'line 16241 of file F_mrna.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:626]

The error applies to the whole SAM file (every line will give this error).
The example line from SAM file:

2_512_865_F3 16 Esi0595_0002 conserved unknown protein [1335] f:2354-3688 613 255 3H47M * 0 0 * AS:i:347

Do you have any solutions?

Thanks.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News