Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error with GTF file when using htseq-count

    Hi,

    Just finished installing HTSeq on a MacOSX with python 2.6.6 and latest version of Numpy.

    I can execute the first few commands of the HTSeq tour using the yeast example sequence file so the install seems to be working

    I invoked the htseq-counts script using the following:
    >python -m HTSeq.scripts.count 45minCt_1.sam cneoh99.gtf

    and I get the following error:
    Error occured in line 1 of file cneoh99.gtf.
    Error: The attribute string seems to contain mismatched quotes.
    [Exception type: ValueError, raised in __init__.py:167]

    The first few lines of my gtf file looks like:
    Chr1 CNA2_FINAL_CALLGENES_1 start_codon 11499 11501 . - 0 "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"
    Chr1 CNA2_FINAL_CALLGENES_1 stop_codon 11060 11062 . - 0 "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"
    Chr1 CNA2_FINAL_CALLGENES_1 exon 11430 11501 . - . "gene_id ""CNAG_00001""; transcript_id ""CNAG_00001T0"";"

    I've attached an excerpt of the file.
    Do I need headers in this file?

    Thanks for any help.

    Regards,
    Maureen
    Attached Files

  • #2
    Well, there obviously are mismatched quotes in your attribute strings. In a proper GTF file, the first line should look like this:

    Code:
    Chr1   CNA2_FINAL_CALLGENES_1   start_codon   11499   11501   .   -   0   gene_id "CNAG_00001"; transcript_id "CNAG_00001T0"
    All these extra quotes make little sense and are confusing to HTSeq. It actually looks a bit as if you loaded the file with a spreadsheet program and saved it again. Doing something like this might introduce extra quotes.

    Where did you get the GTF file from?

    Comment


    • #3
      Same problem, different GTF

      Hi Simon,

      I was wondering if you could possibly help me with my problem. I downloaded the arabidopsis thaliana ensembl gtf from plants.ensembl.org. Here's a sample:

      1 protein_coding CDS 30424421 30424675 . + 0 gene_id "AT1G80990"; transcript_id "AT1G80990.1"; exon_number "1"; gene_name "AT1G80990"; transcript_name "AT1G80990.1"; protein_id "AT1G80990.1";
      1 protein_coding start_codon 30424421 30424423 . + 0 gene_id "AT1G80990"; transcript_id "AT1G80990.1"; exon_number "1"; gene_name "AT1G80990"; transcript_name "AT1G80990.1";
      When I try to run HTSeq, it gives me the same error as above:

      Traceback (most recent call last):
      File "python_scripts/sam_to_gene_array_2.py", line 80, in <module>
      main()
      File "python_scripts/sam_to_gene_array_2.py", line 41, in main
      for feature in gtf:
      File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
      ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
      File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
      raise ValueError, "The attribute string seems to contain mismatched quotes."
      ValueError: The attribute string seems to contain mismatched quotes.
      Any ideas why this could be happening? Thank you in advance, and thank you for all your help in the past.

      Best Regards,
      Artur Jaroszewicz

      Comment


      • #4
        If you download the GTF from the iGenomes, it should work:

        Comment


        • #5
          Still getting the same error:
          Traceback (most recent call last):
          File "/u/home/mcdb/arturj/python_scripts/sam_to_gene_array_2.py", line 80, in <module>
          main()
          File "/u/home/mcdb/arturj/python_scripts/sam_to_gene_array_2.py", line 41, in main
          for feature in gtf:
          File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
          ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
          File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
          raise ValueError, "The attribute string seems to contain mismatched quotes."
          ValueError: The attribute string seems to contain mismatched quotes.
          Any other suggestions?

          Comment


          • #6
            I have the same problem with arabidopsis and RNASeq in Galaxy and I have used different GTF files from ensembl and arabidopsis.org.

            Any ideas?


            Thanks

            Comment


            • #7
              Hi Mahtab,

              Yes, I actually solved the problem. I thought I had posted the solution to my problem, but evidently not. I guess there was another thread that I started. Anyway, there's maybe 100 lines or so that have semicolons in the gene id of the attribute fields, so I wrote a quick script to take care of it. If you'd like to use my modified gtf, you can download it at:
              http://pellegrini.mcdb.ucla.edu/Artu...10.ensembl.gtf

              Good luck in your analysis!

              Artur

              Comment


              • #8
                Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?

                I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right?

                Comment


                • #9
                  Originally posted by jparsons View Post
                  Is it possible to come up with a reasonable standard for the gtf format so programs that only expect one very specific format only get that one specific kind of format, instead of making us spend so much time reformatting files to fit square, triangular, or round pegs into uniquely-shaped holes?

                  I can't say i've spent a great deal of time actually LOOKING at gtf files (although i have spent a great deal of time struggling with getting programs to accept them), but since every data source's gtf format seems to be (eventually) convertible into any type of input, it should be doable, right?
                  There is a standard defined for GTF files. The problem isn't the standard, it's when people create files that do not conform to that standard, e.g. including a semicolon in your gene_id.

                  Comment


                  • #10
                    Hi Artur,

                    Thank you very much for your help. It worked!
                    I had seen the other thread and downloaded the gft from there but for some reason I was still getting the same error.

                    Thanks again
                    Mahtab

                    Comment


                    • #11
                      --Hi,

                      i have a similar problem with gtf file using htseq-count (version 0.5.4p3):

                      samtools view BNV13.sorted.bam | htseq-count -m intersection-nonempty -s no - Rattus_norvegicus.gtf
                      100000 GFF lines processed.
                      200000 GFF lines processed.
                      300000 GFF lines processed.
                      400000 GFF lines processed.
                      500000 GFF lines processed.
                      525298 GFF lines processed.
                      Error: 'itertools.chain' object has no attribute 'get_line_number_string'
                      [Exception type: AttributeError, raised in count.py:201]

                      first lines of gtf file:

                      AABR06112227.1 pseudogene exon 345 455 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "1"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000476932";
                      AABR06112227.1 pseudogene exon 157 342 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "2"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000024118";
                      AABR06112227.1 pseudogene exon 86 154 . - . gene_id "ENSRNOG00000002531"; transcript_id "ENSRNOT00000003418"; exon_number "3"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000470172";
                      AABR06111321.1 miRNA exon 71 156 . + . gene_id "ENSRNOG00000045547"; transcript_id "ENSRNOT00000070977"; exon_number "1"; gene_biotype "miRNA";
                      exon_id "ENSRNOE00000464516";
                      AABR06111321.1 pseudogene exon 170 424 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "1"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000256162";
                      AABR06111321.1 pseudogene exon 429 434 . + . gene_id "ENSRNOG00000047372"; transcript_id "ENSRNOT00000071624"; exon_number "2"; gene_biotype "
                      pseudogene"; exon_id "ENSRNOE00000472450";
                      AABR06111841.1 miRNA exon 87 210 . - . gene_id "ENSRNOG00000046613"; transcript_id "ENSRNOT00000072639"; exon_number "1"; gene_biotype "miRNA";
                      exon_id "ENSRNOE00000503423";
                      AABR06110665.1 protein_coding exon 343 613 . - . gene_id "ENSRNOG00000048972"; transcript_id "ENSRNOT00000061381"; exon_number "1"; gene_name "H2-

                      is there something to do ?

                      thank you --

                      Comment


                      • #12
                        It's a problem with your BAM file.

                        There is a bug in the code that writes the error message which appears only when you read the SAM file from standard input. I'll fix this in the next release. For now, please convert your BAM file to a SAM file, and put the SAM file's name instead of the "-". Then, you should be able to see the actual error message.

                        Comment


                        • #13
                          Error with GTF file when using htseq-count

                          --

                          my problem is over,
                          i've fixed it using samtools view -f 0x2 input.bam | htseq-count .....
                          with the option -f 0x2 all reads not properly paired are discarded.
                          So, in this circonstance the problem is not due to SAM file read from standard input. This bam file was produced by tophat2, maybe a bug of tophat !?

                          Laurent --

                          Comment


                          • #14
                            When i had this error, i removed the fasta sequences from my gff file (the sequences at the end of gff) and it worked!

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-27-2024, 06:37 PM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-27-2024, 06:07 PM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X