Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq: mismatched quotes issue?

    Hi all,

    I have been using HTSeq and DESeq for my RNA Seq pipeline at UCLA, and everything is great except for one thing: I am having trouble with counting reads with HTSeq for Arabidopsis Thaliana. I have so far used two different GTFs, and they both return the same error:

    Traceback (most recent call last):
    File "python_scripts/sam_to_gene_array_2.py", line 80, in <module>
    main()
    File "python_scripts/sam_to_gene_array_2.py", line 41, in main
    for feature in gtf:
    File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 215, in __iter__
    ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
    File "/u/home/mcdb/arturj/.local/lib/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_64.egg/HTSeq/__init__.py", line 168, in parse_GFF_attribute_string
    raise ValueError, "The attribute string seems to contain mismatched quotes."
    ValueError: The attribute string seems to contain mismatched quotes.
    I've looked through the gtf, and I haven't been able to find any mismatched quotes or anything out of the ordinary, except for the attribute line, which has slightly different fields (does not have gene_biotype, has p_id and tss_id). Here's a sample of the gtf file I'm using:

    1 protein_coding exon 3631 3913 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; seqedit "false"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
    1 protein_coding CDS 3760 3913 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
    1 protein_coding start_codon 3760 3762 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "ANAC001"; p_id "P21642"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
    1 protein_coding CDS 3996 4276 . + 2 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "ANAC001"; p_id "P21642"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1"; tss_id "TSS22540";
    Has anyone run into the same problem? I've used two different GTFs based on someone else's suggestion, one from Ensembl plants and another from iGenomes, and they both have the same error.

    Any input or suggestions would be greatly appreciated. I've posted this on a different thread, but it was a bit of a different issue, so I've started a new one.

  • #2
    Might help if you list the commands you used for HT-Seq.

    Comment


    • #3
      I would use the gff files from TAIR as those will be the most up to date and original files.

      TAIR10 can be found here, it is the most recent genome version: ftp://ftp.arabidopsis.org/../../Gene...e/TAIR10_gff3/

      TAIR9 is still often used, it is here:ftp://ftp.arabidopsis.org/../../Gene...se/TAIR9_gff3/

      The TAIR gff files have some slight differences requiring you to give HTSeq some additional details, but they work great.

      Comment


      • #4
        I mean what commands
        Code:
        htseq-count [options] <sam_file> <gff_file>
        for example
        Code:
        htseq-count -t gene -i gene_id <sam_file> <gff_file>

        Comment


        • #5
          Originally posted by chadn737 View Post
          I would use the gff files from TAIR as those will be the most up to date and original files.

          TAIR10 can be found here, it is the most recent genome version: ftp://ftp.arabidopsis.org/../../Gene...e/TAIR10_gff3/

          TAIR9 is still often used, it is here:ftp://ftp.arabidopsis.org/../../Gene...se/TAIR9_gff3/

          The TAIR gff files have some slight differences requiring you to give HTSeq some additional details, but they work great.
          Thanks for the input, I'm trying it now. To the previous response, I pretty much followed Anders' example on this:
          #!/usr/bin/python2.7 -tt

          import HTSeq
          import itertools
          import sys

          def main():

          sys.stderr.write("Initializing program.\n")

          #human
          if sys.argv[1] == 'hg19':
          gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/hg19.ensembl.gtf")
          #mouse
          elif sys.argv[1] == 'mm9':
          gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/mm9.ensembl.gtf")
          elif sys.argv[1] == 'TAIR10':
          gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/TAIR10_GFF3_genes.gff")
          else:
          gtf = HTSeq.GFF_Reader("/u/home/mcdb/arturj/" + sys.argv[1] + ".ensembl.gtf")
          ####
          '''
          path = sys.argv[2]
          if not path[-1] == '/':
          path += '/'
          '''
          filelist = []
          for file in sys.argv[2:]:
          name_split = file.split('/')
          filelist.append(name_split[-1])
          if len(name_split) > 1:
          path = '/'.join(name_split[:-1]) + '/'
          if not path[0] == '/':
          path = '/' + path
          else:
          path = './'
          samplelist = []
          for entry in filelist:
          samplelist.append(entry[:-18])
          ####

          exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True)
          for feature in gtf:
          if feature.type == "exon":
          exons[feature.iv] += feature.name

          sys.stderr.write("Created exon array.\n")

          counts = {}
          for feature in gtf:
          if feature.type == "exon":
          counts[feature.name] = [0] * len(filelist)

          sys.stderr.write("Initialized count dict.\n")

          for filenum in range(len(filelist)):
          sam_file = HTSeq.SAM_Reader( path + filelist[filenum] )
          for alnmt in sam_file:
          if alnmt.aligned:
          intersection_set = None
          for iv2, step_set in exons[alnmt.iv].steps():
          if intersection_set is None:
          intersection_set = step_set.copy()
          else:
          intersection_set.intersection_update(step_set)
          if len(intersection_set) == 1:
          counts[list(intersection_set)[0]][filenum] += 1
          sys.stderr.write("Counted hits per gene for " + samplelist[filenum] + ".\n")

          sys.stderr.write("Printing output.\n")

          sys.stdout.write('Gene\t' + '\t'.join(samplelist) + '\n')
          for gene in sorted(counts.keys()):
          sys.stdout.write(gene)
          for filenum in range(len(filelist)):
          sys.stdout.write('\t' + str(counts[gene][filenum]))
          sys.stdout.write('\n')

          sys.stderr.write("Done.\n")

          if __name__ == "__main__":
          main()
          Artur

          Comment


          • #6
            I seem to have found my problem -- one of the lines in the GTF file has a semicolon in the gene_name: gene_name "PIP1;3"

            I hope this resolves my problem!

            Comment


            • #7
              Ah, that might explain it. On the other hand, maybe not: a semicolon is not a quote. However, I have a dim recollection that I have once seen a prime (a.k.a. single quote: ' ) in an Arabidopsis gene name. So, maybe check for this, too.

              (Who puts special characters into gene names?! Some biologists just seem to want to make life miserable for us bioinformaticians. :-| )

              Comment


              • #8
                Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

                perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

                And the HTSeq gtf reader does not get confused anymore!
                Last edited by areyes; 01-24-2013, 02:32 AM.

                Comment


                • #9
                  Originally posted by areyes View Post
                  Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

                  perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

                  And the HTSeq gtf reader does not get confused anymore!
                  I'll use this script for next time I run into a similar problem. For now, I fixed it manually (realized I should use Perl for this, and learning Perl will take much longer than changing the approximately 20 genes that were giving me problems by hand). For now, if anyone runs into a similar issue, you can use my fixed version: https://docs.google.com/file/d/0B4Em...dvc0kycU0/edit

                  Thanks!

                  Comment


                  • #10
                    I guess it's no longer important, but the reason I was asking which command you used is that I was going to suggest switching the attribute using the -i command. Changing it to gene_id would also have fixed the problem, but then you would need to convert if you wanted gene_name.

                    Comment


                    • #11
                      Originally posted by areyes View Post
                      Yes! this semicolons in gene names also made me suffer at some point. If it is useful for anyone, this perl one liner removes these annoying semicolons and turns then into a "-":

                      perl -e 'open(FILE, "annotationFile.gtf"); while(<FILE>){$_ =~ s/\"\S+;\S+\"/\1\-\2/g; print $_;}' > annotationFile.modified.gtf

                      And the HTSeq gtf reader does not get confused anymore!
                      Areyes,

                      When I used your perl script on my GTF, the gene names with semi-colons were completely replaced by a dash. Do you know why this could have occured? I am using the TAIR10.22 A. thaliana annotation file. Any help is appreciated.

                      Justin

                      Comment


                      • #12
                        That was the point of Alejandro's script: to replace the apostrophes and semicolons that confused the parser with something else. I have fixed the parser a while ago, however, so there is no need any more for this.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        18 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        22 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        17 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        48 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X