SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and dan Literature Watch 1 11-09-2011 04:18 AM
ChIP-Seq: Systematic bias in high-throughput sequencing data and its correction by BE Newsbot! Literature Watch 0 06-08-2011 02:50 AM
ChIP-Seq: Data structures and compression algorithms for high-throughput sequencing t Newsbot! Literature Watch 0 10-16-2010 02:00 AM
ChIP-Seq: Savant: Genome Browser for High Throughput Sequencing Data. Newsbot! Literature Watch 0 06-22-2010 02:00 AM
Dave - interested in trial data sets and high-throughput sequencing tutorials Dave S. Introductions 4 02-02-2010 10:04 AM

Reply
 
Thread Tools
Old 07-15-2010, 07:33 AM   #41
townway
Member
 
Location: Rockville

Join Date: May 2009
Posts: 40
Default

I check the website of HTSeq
http://pypi.python.org/pypi/HTSeq

There is not binary version available, would you give a link to download ?

Thank you

Wei
townway is offline   Reply With Quote
Old 07-16-2010, 03:24 AM   #42
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by townway View Post
There is not binary version available, would you give a link to download ?
For which operating system?
Simon Anders is offline   Reply With Quote
Old 07-16-2010, 08:13 AM   #43
townway
Member
 
Location: Rockville

Join Date: May 2009
Posts: 40
Default

Linux 2.6.18-194.8.1.el5 #1 SMP Wed Jun 23 10:52:51 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux


Thank you
townway is offline   Reply With Quote
Old 07-16-2010, 08:14 PM   #44
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Apologies if this is a series of dumb questions.

I see HTSeq can read SAM files & assume it can also read BAM files. Can it retrieve BAM files from a remote (FTP or HTTP) location as samtools can? Is it using the samtools code or have you reimplemented this in Python?

thanks

Keith R.
krobison is offline   Reply With Quote
Old 07-17-2010, 03:33 PM   #45
Wei-HD
Member
 
Location: Germany

Join Date: Oct 2009
Posts: 59
Default

Hi Simon,

I have a question about qval in DESeq, when I do not have biological replicate and then analyze two samples? How does DESeq calculate the pval. Sorry I have read your paper, but the formula is very complicated for me? Can you explain this to me?

Thanks a lot!
Wei
Wei-HD is offline   Reply With Quote
Old 07-20-2010, 09:17 AM   #46
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Keith

Quote:
Originally Posted by krobison View Post
I see HTSeq can read SAM files & assume it can also read BAM files. Can it retrieve BAM files from a remote (FTP or HTTP) location as samtools can? Is it using the samtools code or have you reimplemented this in Python?
At the moment, HTSeq can natively only work with SAM files. Adding BAM support is on my to-do list, and of course, I would do it by simply wrapping the samtools.

For now, however, you can simply call the 'samtools' binary from Python and pipe its output to HTSeq's SAM_Reader class. The following function does that for you:

Code:
import subprocess
import HTSeq

def BAM_Region_Reader( bamfile, region=None, samtools_exec="samtools" ):

   """Get Reader object for part of a BAM file
   
   bamfile -- a string, either a file name or an URL to a BAM file
   region -- a GenomicInterval
   samtools_exec -- file name (with path, if necessary) 
      of the 'samtools' executable
   
   Returns a SAM_Reader object.
   """

   cmd_line = ( samtools_exec , "view", bamfile )
   
   if region is not None:
      cmd_line += ( "%s:%d-%d" % 
         ( region.chrom, region.start, region.end ), )

   samtools_view = subprocess.Popen( cmd_line, 
      stdout = subprocess.PIPE )

   return HTSeq.SAM_Reader( samtools_view.stdout )
With this, you can now do, e.g.:
Code:
bamfile = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00155/alignment/HG00155.chrom12.ILLUMINA.bwa.GBR.low_coverage.20100517.bam"
region = HTSeq.GenomicInterval( "12", 90000, 91000, "." )

for alignment in BAM_Region_Reader( bamfile, region ):
   print alignment
Maybe not the optimum in performance but should do the job.

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 07-20-2010, 09:21 AM   #47
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi

Quote:
Originally Posted by townway View Post
Linux 2.6.18-194.8.1.el5 #1 SMP Wed Jun 23 10:52:51 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
For Linux, please use the source version, this makes it easier for both of us. Please see here for instructions. (Do not use easy_install, rather, download the source package and install it as described.)

Simon
Simon Anders is offline   Reply With Quote
Old 07-21-2010, 03:39 PM   #48
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

Hi Simon,

Many thanks for DESeq and HTSeq. I have been using BedTools to get the count data for DESeq analyses but would like to try HTSeq as well. However, after building it from source, I get the following error:
Code:
[sensh@saturn ~]$ python
Python 2.6 (r26:66714, Oct  2 2008, 16:03:09) 
[GCC 3.4.6 20060404 (Red Hat 3.4.6-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import HTSeq
>>> htseq-count --help
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'htseq' is not defined
It seems to me that I am missing something at a fundamental level. Any help from your part would be much appreciated.

Thanks,

Shurjo
shurjo is offline   Reply With Quote
Old 07-22-2010, 01:43 AM   #49
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Shurjo

htseq-count is a command-line tool. You start it from your shell, without first calling python.

Simon
Simon Anders is offline   Reply With Quote
Old 07-25-2010, 03:50 PM   #50
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

Thanks, Simon. I guess that is about as fundamental as an error can get on my part.
shurjo is offline   Reply With Quote
Old 07-27-2010, 03:52 PM   #51
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

Hi Simon,

I am getting the error below when running HTSeq-count with a SAM file produced by TopHatv1.0.14. The script works fine when tested with a smaller subset (head -1000000) from the same SAM file but fails on the bigger file, which has ~75 million reads. Maybe the last line in the error hints at a malformed SAM file?

Any advice is much appreciated,

Thanks,

Shurjo

Code:
[sensh@saturn htseq_counts]$ python -m HTSeq.scripts.count -s no rep_C_08010264.SE.accepted_hits.sam ~/GTF_files/Homo_sapiens.withchr.NCBI36.50.gtf > rep_C_08010264.SE.htseq.counts
Traceback (most recent call last):
  File "/home/pchines/lib/python2.6/runpy.py", line 121, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/pchines/lib/python2.6/runpy.py", line 34, in _run_code
    exec code in run_globals
  File "/home/sensh/.local/lib/python2.6/site-packages/HTSeq-0.4.4p5-py2.6-linux-x86_64.egg/HTSeq/scripts/count.py", line 202, in <module>
    main()
  File "/home/sensh/.local/lib/python2.6/site-packages/HTSeq-0.4.4p5-py2.6-linux-x86_64.egg/HTSeq/scripts/count.py", line 191, in main
    sys.stderr.write( "Error: %s\n" % "; ".join( sys.exc_info()[1] ) )
TypeError: sequence item 0: expected string, int found
shurjo is offline   Reply With Quote
Old 07-28-2010, 12:44 PM   #52
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi shurjo

Could you please try again with the newest version, 0.4.4p6? The previous version, p5, had a bug in the error reporting which occludes the actual error message. The p6 version should give you a more informative error message, most likely pointing to a malformed line in your SAM file.

Simon
Simon Anders is offline   Reply With Quote
Old 08-16-2010, 03:09 AM   #53
Anna Esteve
Junior Member
 
Location: Spain

Join Date: Jan 2010
Posts: 7
Default

Quote:
Originally Posted by Thomas Doktor View Post
I'm trying to use htseq-count version 0.4.2-p3 on a sam file produced by TopHat and a hg19 Ensembl GTF file. I'm analysing the reads in non-stranded mode and looking for exons in the gene_id features. The script runs for a while and outputs several warnings about reads incorrectly flagged as proper pairs, but then exits with the following error:
Is this an error in my sam file and if so how can I identify the read in question?
I am having the same problem as you, when i run HTseq-count I have warnings about improper mate pairs. Tophat sam output contains improper mate pairs (the mate is unmapped) as well as proper pairs. My question is if HTseq-count discards counts comming from improper pairs, if not how to deal with that?I have a great amount of improper pairs.
thanks.
Anna Esteve is offline   Reply With Quote
Old 08-26-2010, 09:07 AM   #54
haomself
Junior Member
 
Location: tennessee

Join Date: Aug 2009
Posts: 2
Default

Quote:
Originally Posted by Simon Anders View Post
Hi


I don't have much experience with SOLiD data, but I'd be interested to collaborate with a SOLiD-using bioinformatician to fill in the gaps in order to make HTSeq really useful for colour-space data. I think not much is missing for this.

Cheers
Simon
Simon,

I am testing htseq-count on SOLiD data. The sam file was generated using SHRiMP. It reported the following error message:

Error occured in line 1 of file rat1.saet.shrimp.sam.
Error: ("'seq' and 'qualstr' do not have the same length.", 'line 1 of file rat1.saet.shrimp.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:621]

And here is the first few lines of the SAM file.
1241_771_1428_F3 0 chr1 12208 255 12H38M * 0 0 ATTGCCTCAGGTTCCCTCCTCAGTGTGTAGTGTGAGAT * AS:i:732 CS:Z:T11221133111133013022120102002202212111113211112113 XX:Z:ATTGCCTCAGGTTCCCTCCTCAGTGTGTAGTGTGAgaT
25_503_796_F3 0 chr1 12290 255 2H48M * 0 0 TGCTGCTTTGATTGTGTCCAGTGCCAAGAAAATGAGATTGCCAATGAT * AS:i:862 CS:Z:T22013213200123011122012113010220030120330130103033 XX:Z:TGCTGCTTTGATTGTGtCCAGTGCCAAGAAAatGAgaTTGCCAATgaT
25_503_796_F3 0 chr1 14756 255 2H48M * 0 0 TGCTGCTTTGATTGTGTCCAGTGCCAAGAAAATGAGATTGCCAATGAT * AS:i:862 CS:Z:T22013213200123011122012113010220030120330130103033 XX:Z:TGCTGCTTTGATTGTGtCCAGTGCCAAGAAAatGAgaTTGCCAATgaT

Do you have any suggestions on how to get around this?

Thanks.
haomself is offline   Reply With Quote
Old 09-07-2010, 05:57 AM   #55
bbl
Member
 
Location: London

Join Date: Jul 2010
Posts: 16
Default

Hello Simon,

HTSeq seems to be the right tool I need for my work. I am trying to count how many times each base in an interesting region of exon has been covered by a read. Can 'counting' function in HTSeq do that? If so, how to do it?

Thanks
bbl is offline   Reply With Quote
Old 09-16-2010, 03:29 AM   #56
yh253
Member
 
Location: Ireland

Join Date: Jul 2009
Posts: 16
Default

Hi Simon,

I tried to use 'htseq-count' to count reads on UCSC hg19 known genes. My data is single end Illumina reads aligned to UCSC hg19 by bowtie with upto 10 multiple alignment allowed.

Quote:
$head -n 30 test.sam
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@SQ SN:chrM LN:16571
@PG ID:Bowtie VN:0.12.5 CL:"/bowtie -t -p 4 -v 2 -k 11 -m 10 --strata --best --sam hg19 test.fq test.sam"
HWI-EAS283_0007:1:1:1029:14030#NGCTAT/1 0 chrM 9647 255 30M * 0 0 CTCACCATAGTCTAATAGAAAACANNCGAA CCCCCCCCCACCCCCCCCCC########## XA:i:2 MD:Z:24A0C4 NM:i:2
HWI-EAS283_0007:1:1:1029:14030#NGCTAT/1 0 chr1 570195 255 30M * 0 0 CTCACCATAGTCTAATAGAAAACANNCGAA CCCCCCCCCACCCCCCCCCC########## XA:i:2 MD:Z:24A0C4 NM:i:2
HWI-EAS283_0007:1:1:1029:3318#NGCTAT/1 16 chr6 43491059 255 30M * 0 0 GAACNNACCAAGTTCTCTTCAGCACTTAGC ##########CCC;CCCBCCCCCCC@CCCC XA:i:2 MD:Z:4T0G24 NM:i:2
The GTF file was downloaded from UCSC known gene table:
Quote:
$ head knownGene_hg19.gtf
chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1 hg19_knownGene exon 12613 12721 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene CDS 12595 12721 0.000000 + 1 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene exon 12595 12721 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene CDS 13403 13636 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene stop_codon 13637 13639 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
I issued the following command to do the count:
Quote:
htseq-count --stranded=no test.sam knownGene_hg19.gtf > test.count
However, only 30% of my reads successfully mapped to known genes, and over 50% of them were taken as 'ambiguous'. This happened to all my samples ( 2 samples with 4 biological replicates for each) by using any of the three modes: union/intersection-strict/intersection-nonempty. Do you think this is a problem with my gtf file or 'htseq-count' doesn't work well with my data? I appreciate any input!

Quote:
$tail test.count
uc011nbv.1 0
uc011nbw.1 0
uc011nbx.1 0
uc011nby.1 0
uc011nbz.1 0
uc011nca.1 0
uc011ncb.1 0
uc011ncc.1 10
no_feature 1886981
ambiguous 3142708
yh253 is offline   Reply With Quote
Old 09-16-2010, 04:16 AM   #57
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi yh253

Your GTF file does look strange. It is important that all exons from the same gene have the same gene_id. However, in your GTF file, the gene_id and the transcript_id seems to be the same. So, if a read maps to an exon shared by two transcripts of the same gene, htseq-count will see two different gene IDs and think that it is two different genes although it is only two different transcripts of the same gene.

It seems that UCSC and Ensembl disagree about what the point of the gene_id attribute in GTF files is. Maybe, try again with the GTF file from Ensembl. (But make sure the coordinates are the same.)

And make sure that strand option is set correctly.

Simon
Simon Anders is offline   Reply With Quote
Old 09-16-2010, 04:18 AM   #58
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi

Quote:
Originally Posted by bbl View Post
HTSeq seems to be the right tool I need for my work. I am trying to count how many times each base in an interesting region of exon has been covered by a read. Can 'counting' function in HTSeq do that? If so, how to do it?
Sure, this should be possible with HTSeq. Do you have some basic knowledge of Python? If so, I can help you if you explain a bit more what exactly you want to do.

Simon
Simon Anders is offline   Reply With Quote
Old 09-20-2010, 06:45 AM   #59
yh253
Member
 
Location: Ireland

Join Date: Jul 2009
Posts: 16
Default

Hi Simon,

Is it true that if a same read appears more than once in the alignment.sam, HTSeq-count will take it as ambiguous? I have an alignment files generated from bowtie which allows upto 10 multiple alignment for a single read (Multi-reads), i.e. a read may have more than one alignment records in the alignment.sam file. I got an impression that HTSeq-count didn't assign it to any gene.

Yuan
yh253 is offline   Reply With Quote
Old 09-20-2010, 08:04 AM   #60
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by yh253 View Post
Is it true that if a same read appears more than once in the alignment.sam, HTSeq-count will take it as ambiguous? I have an alignment files generated from bowtie which allows upto 10 multiple alignment for a single read (Multi-reads), i.e. a read may have more than one alignment records in the alignment.sam file. I got an impression that HTSeq-count didn't assign it to any gene.
HTSeq-count considers each line in the SAM file independently from all the others (except that paired reads are taken together). Hence, if a read appears 10 times in the file, it will be counted 10 times.

In your case, each of the 10 times, the read ended up to be counted as ambiguous, but the reason for this is probably not related to it being a multiply matching read.

S
Simon Anders is offline   Reply With Quote
Reply

Tags
htseq, python

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:16 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO