![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
htseq-count for sam and gff3 | sofia17 | RNA Sequencing | 45 | 11-04-2016 04:32 PM |
DEXSeq Using Counts File From htseq-count | FuzzyCoder | Bioinformatics | 20 | 01-04-2016 12:18 AM |
Strange error when using htseq-count | shhuang | Bioinformatics | 13 | 11-19-2012 01:40 AM |
Count difference by htseq-cout and samtools | DZhang | Bioinformatics | 2 | 07-03-2011 12:05 PM |
Htseq-count Vs CountOverlap function in IRanges | biofreak | General | 5 | 06-29-2011 11:36 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: St. Louis, MO Join Date: May 2010
Posts: 14
|
![]()
Hi,
Just finished installing HTSeq on a MacOSX with python 2.6.6 and latest version of Numpy. I can execute the first few commands of the HTSeq tour using the yeast example sequence file so the install seems to be working I invoked the htseq-counts script using the following: >python -m HTSeq.scripts.count 45minCt_1.sam cneoh99.gtf and I get the following error: Error occured in line 1 of file cneoh99.gtf. Error: The attribute string seems to contain mismatched quotes. [Exception type: ValueError, raised in __init__.py:167] The first few lines of my gtf file looks like: Chr1 CNA2_FINAL_CALLGENES_1 start_codon 11499 11501 . - 0 "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";" Chr1 CNA2_FINAL_CALLGENES_1 stop_codon 11060 11062 . - 0 "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";" Chr1 CNA2_FINAL_CALLGENES_1 exon 11430 11501 . - . "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";" I've attached an excerpt of the file. Do I need headers in this file? Thanks for any help. Regards, Maureen |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994
|
![]()
Well, there obviously are mismatched quotes in your attribute strings. In a proper GTF file, the first line should look like this:
Code:
Chr1 CNA2_FINAL_CALLGENES_1 start_codon 11499 11501 . - 0 gene_id "CNAG_00001"; transcript_id "CNAG_00001T0" Where did you get the GTF file from? |
![]() |
![]() |
![]() |
#3 | ||
Member
Location: Los Angeles Join Date: Sep 2011
Posts: 45
|
![]()
Hi Simon,
I was wondering if you could possibly help me with my problem. I downloaded the arabidopsis thaliana ensembl gtf from plants.ensembl.org. Here's a sample: Quote:
Quote:
Best Regards, Artur Jaroszewicz |
||
![]() |
![]() |
![]() |
#4 |
Member
Location: Freiburg Join Date: Oct 2012
Posts: 56
|
![]()
If you download the GTF from the iGenomes, it should work:
http://tophat.cbcb.umd.edu/igenomes.html |
![]() |
![]() |
![]() |
#5 | |
Member
Location: Los Angeles Join Date: Sep 2011
Posts: 45
|
![]()
Still getting the same error:
Quote:
|
|
![]() |
![]() |
![]() |
#6 |
Member
Location: University of Melbourne Join Date: Aug 2011
Posts: 10
|
![]()
I have the same problem with arabidopsis and RNASeq in Galaxy and I have used different GTF files from ensembl and arabidopsis.org.
Any ideas? Thanks |
![]() |
![]() |
![]() |
#7 |
Member
Location: Los Angeles Join Date: Sep 2011
Posts: 45
|
![]()
Hi Mahtab,
Yes, I actually solved the problem. I thought I had posted the solution to my problem, but evidently not. I guess there was another thread that I started. Anyway, there's maybe 100 lines or so that have semicolons in the gene id of the attribute fields, so I wrote a quick script to take care of it. If you'd like to use my modified gtf, you can download it at: http://pellegrini.mcdb.ucla.edu/Artu...10.ensembl.gtf Good luck in your analysis! Artur |
![]() |
![]() |
![]() |
#8 |
Member
Location: SF Bay Area Join Date: Feb 2012
Posts: 62
|
![]()
Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?
I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right? |
![]() |
![]() |
![]() |
#9 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#10 |
Member
Location: University of Melbourne Join Date: Aug 2011
Posts: 10
|
![]()
Hi Artur,
Thank you very much for your help. It worked! I had seen the other thread and downloaded the gft from there but for some reason I was still getting the same error. Thanks again Mahtab |
![]() |
![]() |
![]() |
#11 |
Junior Member
Location: france Join Date: Sep 2010
Posts: 24
|
![]()
--Hi,
i have a similar problem with gtf file using htseq-count (version 0.5.4p3): samtools view BNV13.sorted.bam | htseq-count -m intersection-nonempty -s no - Rattus_norvegicus.gtf 100000 GFF lines processed. 200000 GFF lines processed. 300000 GFF lines processed. 400000 GFF lines processed. 500000 GFF lines processed. 525298 GFF lines processed. Error: 'itertools.chain' object has no attribute 'get_line_number_string' [Exception type: AttributeError, raised in count.py:201] first lines of gtf file: AABR06112227.1 pseudogene exon 345 455 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "1"; gene_biotype " pseudogene"; exon_id "ENSRNOE00000476932"; AABR06112227.1 pseudogene exon 157 342 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "2"; gene_biotype " pseudogene"; exon_id "ENSRNOE00000024118"; AABR06112227.1 pseudogene exon 86 154 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "3"; gene_biotype " pseudogene"; exon_id "ENSRNOE00000470172"; AABR06111321.1 miRNA exon 71 156 . + . gene_id "ENSRNOG00000045547"; transcript_id "ENSRNOT00000070977"; exon_number "1"; gene_biotype "miRNA"; exon_id "ENSRNOE00000464516"; AABR06111321.1 pseudogene exon 170 424 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "1"; gene_biotype " pseudogene"; exon_id "ENSRNOE00000256162"; AABR06111321.1 pseudogene exon 429 434 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "2"; gene_biotype " pseudogene"; exon_id "ENSRNOE00000472450"; AABR06111841.1 miRNA exon 87 210 . - . gene_id "ENSRNOG00000046613"; transcript_id "ENSRNOT00000072639"; exon_number "1"; gene_biotype "miRNA"; exon_id "ENSRNOE00000503423"; AABR06110665.1 protein_coding exon 343 613 . - . gene_id "ENSRNOG00000048972"; transcript_id "ENSRNOT00000061381"; exon_number "1"; gene_name "H2- is there something to do ? thank you -- |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994
|
![]()
It's a problem with your BAM file.
There is a bug in the code that writes the error message which appears only when you read the SAM file from standard input. I'll fix this in the next release. For now, please convert your BAM file to a SAM file, and put the SAM file's name instead of the "-". Then, you should be able to see the actual error message. |
![]() |
![]() |
![]() |
#13 |
Junior Member
Location: france Join Date: Sep 2010
Posts: 24
|
![]()
--
my problem is over, i've fixed it using samtools view -f 0x2 input.bam | htseq-count ..... with the option -f 0x2 all reads not properly paired are discarded. So, in this circonstance the problem is not due to SAM file read from standard input. This bam file was produced by tophat2, maybe a bug of tophat !? Laurent -- |
![]() |
![]() |
![]() |
#14 |
Junior Member
Location: st. louis Join Date: Jun 2011
Posts: 6
|
![]()
When i had this error, i removed the fasta sequences from my gff file (the sequences at the end of gff) and it worked!
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|