Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Artur Jaroszewicz
    Member
    • Sep 2011
    • 45

    HTSeq: mismatched quotes issue?

    Hi all,

    I have been using HTSeq and DESeq for my RNA Seq pipeline at UCLA, and everything is great except for one thing: I am having trouble with counting reads with HTSeq for Arabidopsis Thaliana. I have so far used two different GTFs, and they both return the same error:

    Traceback (most recent call last):
    File "python_scripts/sam_to_gene_array_2.py", line 80, in <module>
    main()
    File "python_scripts/sam_to_gene_array_2.py", line 41, in main
    for feature in gtf:
    File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
    ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
    File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
    raise ValueError, "The attribute string seems to contain mismatched quotes."
    ValueError: The attribute string seems to contain mismatched quotes.
    I've looked through the gtf, and I haven't been able to find any mismatched quotes or anything out of the ordinary, except for the attribute line, which has slightly different fields (does not have gene_biotype, has p_id and tss_id). Here's a sample of the gtf file I'm using:

    1 protein_coding exon 3631 3913 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; seqedit "false"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
    1 protein_coding CDS 3760 3913 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
    1 protein_coding start_codon 3760 3762 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
    1 protein_coding CDS 3996 4276 . + 2 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "ANAC001"; p_id "P21642"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
    Has anyone run into the same problem? I've used two different GTFs based on someone else's suggestion, one from Ensembl plants and another from iGenomes, and they both have the same error.

    Any input or suggestions would be greatly appreciated. I've posted this on a different thread, but it was a bit of a different issue, so I've started a new one.
  • Jeremy
    Senior Member
    • Nov 2009
    • 190

    #2
    Might help if you list the commands you used for HT-Seq.

    Comment

    • chadn737
      Senior Member
      • Jan 2009
      • 392

      #3
      I would use the gff files from TAIR as those will be the most up to date and original files.

      TAIR10 can be found here, it is the most recent genome version: ftp://ftp.arabidopsis.org/../../Gene...e/TAIR10_gff3/

      TAIR9 is still often used, it is here:ftp://ftp.arabidopsis.org/../../Gene...se/TAIR9_gff3/

      The TAIR gff files have some slight differences requiring you to give HTSeq some additional details, but they work great.

      Comment

      • Jeremy
        Senior Member
        • Nov 2009
        • 190

        #4
        I mean what commands
        Code:
        htseq-count [options] <sam_file> <gff_file>
        for example
        Code:
        htseq-count -t gene -i gene_id <sam_file> <gff_file>

        Comment

        • Artur Jaroszewicz
          Member
          • Sep 2011
          • 45

          #5
          Originally posted by chadn737 View Post
          I would use the gff files from TAIR as those will be the most up to date and original files.

          TAIR10 can be found here, it is the most recent genome version: ftp://ftp.arabidopsis.org/../../Gene...e/TAIR10_gff3/

          TAIR9 is still often used, it is here:ftp://ftp.arabidopsis.org/../../Gene...se/TAIR9_gff3/

          The TAIR gff files have some slight differences requiring you to give HTSeq some additional details, but they work great.
          Thanks for the input, I'm trying it now. To the previous response, I pretty much followed Anders' example on this:
          #!/usr/bin/python2.7 -tt

          import HTSeq
          import itertools
          import sys

          def main():

          sys.stderr.write("Initializing program.\n")

          #human
          if sys.argv[1] == 'hg19':
          gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/hg19.ensembl.gtf")
          #mouse
          elif sys.argv[1] == 'mm9':
          gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/mm9.ensembl.gtf")
          elif sys.argv[1] == 'TAIR10':
          gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/TAIR10_GFF3_genes.gff")
          else:
          gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/" + sys.argv[1] + ".ensembl.gtf")
          ####
          '''
          path = sys.argv[2]
          if not path[-1] == '/':
          path += '/'
          '''
          filelist = []
          for file in sys.argv[2:]:
          name_split = file.split('/')
          filelist.append(name_split[-1])
          if len(name_split) > 1:
          path = '/'.join(name_split[:-1]) + '/'
          if not path[0] == '/':
          path = '/' + path
          else:
          path = './'
          samplelist = []
          for entry in filelist:
          samplelist.append(entry[:-18])
          ####

          exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True)
          for feature in gtf:
          if feature.type == "exon":
          exons[feature.iv] += feature.name

          sys.stderr.write("Created exon array.\n")

          counts = {}
          for feature in gtf:
          if feature.type == "exon":
          counts[feature.name] = [0] * len(filelist)

          sys.stderr.write("Initialized count dict.\n")

          for filenum in range(len(filelist)):
          sam_file = HTSeq.SAM_Reader( path + filelist[filenum] )
          for alnmt in sam_file:
          if alnmt.aligned:
          intersection_set = None
          for iv2, step_set in exons[alnmt.iv].steps():
          if intersection_set is None:
          intersection_set = step_set.copy()
          else:
          intersection_set.intersection_update(step_set)
          if len(intersection_set) == 1:
          counts[list(intersection_set)[0]][filenum] += 1
          sys.stderr.write("Counted hits per gene for " + samplelist[filenum] + ".\n")

          sys.stderr.write("Printing output.\n")

          sys.stdout.write('Gene\t' + '\t'.join(samplelist) + '\n')
          for gene in sorted(counts.keys()):
          sys.stdout.write(gene)
          for filenum in range(len(filelist)):
          sys.stdout.write('\t' + str(counts[gene][filenum]))
          sys.stdout.write('\n')

          sys.stderr.write("Done.\n")

          if __name__ == "__main__":
          main()
          Artur

          Comment

          • Artur Jaroszewicz
            Member
            • Sep 2011
            • 45

            #6
            I seem to have found my problem -- one of the lines in the GTF file has a semicolon in the gene_name: gene_name "PIP1;3"

            I hope this resolves my problem!

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              Ah, that might explain it. On the other hand, maybe not: a semicolon is not a quote. However, I have a dim recollection that I have once seen a prime (a.k.a. single quote: ' ) in an Arabidopsis gene name. So, maybe check for this, too.

              (Who puts special characters into gene names?! Some biologists just seem to want to make life miserable for us bioinformaticians. :-| )

              Comment

              • areyes
                Senior Member
                • Aug 2010
                • 165

                #8
                Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

                perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

                And the HTSeq gtf reader does not get confused anymore!
                Last edited by areyes; 01-24-2013, 02:32 AM.

                Comment

                • Artur Jaroszewicz
                  Member
                  • Sep 2011
                  • 45

                  #9
                  Originally posted by areyes View Post
                  Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

                  perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

                  And the HTSeq gtf reader does not get confused anymore!
                  I'll use this script for next time I run into a similar problem. For now, I fixed it manually (realized I should use Perl for this, and learning Perl will take much longer than changing the approximately 20 genes that were giving me problems by hand). For now, if anyone runs into a similar issue, you can use my fixed version: https://docs.google.com/file/d/0B4Em...dvc0kycU0/edit

                  Thanks!

                  Comment

                  • Jeremy
                    Senior Member
                    • Nov 2009
                    • 190

                    #10
                    I guess it's no longer important, but the reason I was asking which command you used is that I was going to suggest switching the attribute using the -i command. Changing it to gene_id would also have fixed the problem, but then you would need to convert if you wanted gene_name.

                    Comment

                    • JustinH
                      Junior Member
                      • Aug 2014
                      • 7

                      #11
                      Originally posted by areyes View Post
                      Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

                      perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

                      And the HTSeq gtf reader does not get confused anymore!
                      Areyes,

                      When I used your perl script on my GTF, the gene names with semi-colons were completely replaced by a dash. Do you know why this could have occured? I am using the TAIR10.22 A. thaliana annotation file. Any help is appreciated.

                      Justin

                      Comment

                      • Simon Anders
                        Senior Member
                        • Feb 2010
                        • 995

                        #12
                        That was the point of Alejandro's script: to replace the apostrophes and semicolons that confused the parser with something else. I have fixed the parser a while ago, however, so there is no need any more for this.

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                          by SEQadmin2


                          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                          Here are nine questions we think about, in roughly the order they matter, before...
                          06-18-2026, 07:11 AM
                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, 06-17-2026, 06:09 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-09-2026, 11:58 AM
                        0 responses
                        96 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-05-2026, 10:09 AM
                        0 responses
                        117 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-04-2026, 08:59 AM
                        0 responses
                        109 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...