SEQanswers (
-   RNA Sequencing (
-   -   Issue with Htseq-count on BAM files from Tophat2 using Galaxy (

Enriquez 09-05-2017 04:23 AM

Issue with Htseq-count on BAM files from Tophat2 using Galaxy

I'm currently facing troubles using galaxy. I want to compare differentially expressed genes between two treatment groups. I already map my reads on my reference genome (70% remaping) and now I'm trying to obtain the differential expression matrix using Htseq count. (For information, my data are Illumina Hiseq 2500, pair end, 125pb).

I already map my reads on my reference genome thanks to Tophat2 (70%remaping), but when I tried to run Htseq on the Bam files from Htseq send me this error message:

Fatal error: Unknown error occured Error occured when processing GFF file (line 40 of file /opt/galaxy-dist/database/files/002/052/dataset_2052791.dat): Feature DS10_00012179-RA:exon:1059 does not contain a 'gene_id' attribute [Exception type: ValueError, raised in]

I though that maybe it could an issue due to my gff3 file, and I tried to convert it into a gtf file using the GFF to GTF converter. But I obtain the following error message:

Traceback (most recent call last): File "/opt/shed_tools/", line 17, in <module> import GFFParser File "/opt/shed_tools/", line 20, in <module> import as sio ImportError: No module named

I read that it could be because my Bam files were not sorted by the gene id. So, I tried to sort my Bam files using the tool sort from the SAMtool suite, and obtain an error message again:

Tool execution generated the following error message: Error running samtools sort. mv: cannot stat `foo.bam': No such file or directory The tool produced the following additional output: [bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT Set maximum memory per thread; suffix K/M/G recognized [768M] -n Sort by read name -o FILE Write final output to FILE rather than standard output -T PREFIX Write temporary files to PREFIX.nnnn.bam -@, --threads INT Set number of sorting and compression threads [1] --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null]

I do not understand why I received as much error messages. Does anyone face up a similar issue? Or knows where this problems come from?

Thank you in advance

neavemj 09-17-2017 08:21 PM

Hi Enriquez,

You should not need to sort the bam file, so I don't think that's the problem.

Do you know if your gff file contains the 'gene_id' attribute? You can open the file in a text editor and check that this is listed. Otherwise, you can change the gene id variable using '--idattr'. This option should also be available in galaxy.

I think converting your file from gff3 to gtf is also a pretty good idea. I think I've done this in the past and it worked. The error you are getting suggests that the python library 'scipy' is not installed in your galaxy configuration. Perhaps you can get the system administrators to install it for you?



Enriquez 10-04-2017 04:36 AM

Hello ,
Sorry for this long time withtout answer.
So i set up Htseq count with the term "transcript_id" instead of gene_id and I convert my gff3 into gtf using gffread. Htseq has finished his job without any error report, but the output matrix contains only 0 :
Geneid TopHat on data 69 data 16 and data 15: accepted_hits
DS10_00000001 0
DS10_00000002 0
DS10_00000003 0
DS10_00000004 0
DS10_00000005 0
DS10_00000006 0
DS10_00000007 0
DS10_00000008 0
DS10_00000009 0

Did you face up a similar problem?
Thanks a lot for your help,

neavemj 10-04-2017 02:49 PM

Hi Thomas, nice work figuring out those other problems!

I'm not sure why HTSeq is outputting only 0s. You might just have to look very carefully at all of the input files. Make sure that gtf is in the right format, and nothing weird has happened in the conversion. Also make sure that the names in your reference genome (which you used to make the bam files) match the names in your new gff.

Also, by default in HTseq, multi-mapping reads will not be counted. Perhaps by using 'transcript_id' this produces what appears to be multi-mapping reads across the same gene?

These are my guesses anyway!

Good luck,


All times are GMT -8. The time now is 01:11 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.