Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq 'gene_id' attribute error

    I have just started using HTSeq and found problems using GFF file. I have checked some of the threads in seqanswers but couldn't fix.

    Here is the command and error:

    htseq-count sorted.sam ../ref_CanFam3.1_top_level.gff3
    Error occured in line 9 of file ../ref_CanFam3.1_top_level.gff3.
    Error: Feature id1 does not contain a 'gene_id' attribute
    [Exception type: SystemExit, raised in count.py:55]


    Below are a few lines of my GFF file:


    #gff-version 3
    #!gff-spec-version 1.20
    #!processor NCBI annotwriter
    ##sequence-region chr1 1 122678785
    ##species http://www.ncbi.nlm.nih.gov/Taxonomy...ax.cgi?id=9615
    chr1 RefSeq region 1 122678785 . + . ID=id0;Name=1;Dbxref=taxon:9615;breed=boxer;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;sub-species=familiaris
    chr1 Gnomon gene 251990 322081 . - . ID=gene0;Name=ENPP1;Dbxref=GeneID:100856150;gbkey=Gene;gene=ENPP1;part=1%2F1
    chr1 Gnomon mRNA 251990 322081 . - . ID=rna0;Name=XM_003638785.2;Parent=gene0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 289674 289746 . - . ID=id2;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 287787 287903 . - . ID=id3;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 286842 286967 . - . ID=id4;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
    chr1 Gnomon exon 285904 285964 . - . ID=id5;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2



    Could someone help to fix this?

  • #2
    Either specify a new value for "-i " or fix the GFF so it has a gene_id field.

    Comment


    • #3
      Originally posted by dpryan View Post
      Either specify a new value for "-i " or fix the GFF so it has a gene_id field.
      Could you give a sample output line of the gff file showing how it should be modified?

      Thanks.

      Comment


      • #4
        Code:
         chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2;gene_id=100856150
        Those gene_ids happen to be from NCBI, should you wonder later. It should be pretty easy to just parse each line and extract the right part. In theory, you could instead use "-i gene", but that's unlikely to always be unique.

        Comment


        • #5
          Originally posted by dpryan View Post
          Code:
           chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2;gene_id=100856150
          Those gene_ids happen to be from NCBI, should you wonder later. It should be pretty easy to just parse each line and extract the right part. In theory, you could instead use "-i gene", but that's unlikely to always be unique.
          I have tried using:
          htseq-count -i gene file.sam file.gff > count.txt

          It processed without any error but the counts are zero.The output looks like:

          A1BG 0
          A1CF 0
          A2M 0
          A2ML1 0
          .
          .
          .
          no_feature 38443217
          ambiguous 0
          too_low_aQual 0
          not_aligned 0
          alignment_not_unique 9202815

          I also tried by changing the gff file to the format suggested by you:

          ##gff-version 3
          #!gff-spec-version 1.20
          #!processor NCBI annotwriter
          ##sequence-region chr1 1 122678785
          ##species http://www.ncbi.nlm.nih.gov/Taxonomy...ax.cgi?id=9615
          chr1 RefSeq region 1 122678785 . + . ID=id0;Name=1;Dbxref=taxon:9615;breed=boxer;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;sub-species=familiaris,gene_id=9615
          chr1 Gnomon gene 251990 322081 . - . ID=gene0;Name=ENPP1;Dbxref=GeneID:100856150;gbkey=Gene;gene=ENPP1;part=1%2F1,gene_id=100856150
          chr1 Gnomon mRNA 251990 322081 . - . ID=rna0;Name=XM_003638785.2;Parent=gene0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2,gene_id=100856150

          The output and error:
          htseq-count sorted.sam ../refGene_canFam3.gff > ref_counts.txt
          Error occured in line 6 of file ../refGene_canFam3.gff.
          Error: need more than 1 value to unpack
          [Exception type: ValueError, raised in __init__.py:220]



          Any further tips to fix this?

          Comment


          • #6
            Firstly you need a semicolon before "gene_id". Secondly, if you're getting 0 counts when you specify "-i gene" then there's something else wrong as well. You might just make a miniature SAM file with a few reads that map uniquely and should be counted and just see what happens (try using the -o option to help in tracking).

            Comment


            • #7
              Originally posted by dpryan View Post
              Firstly you need a semicolon before "gene_id". Secondly, if you're getting 0 counts when you specify "-i gene" then there's something else wrong as well. You might just make a miniature SAM file with a few reads that map uniquely and should be counted and just see what happens (try using the -o option to help in tracking).
              I have placed semicolon before gene_id and tried htseq-count.

              Error occured in line 6 of file ../refGene_canFam3.gff.
              Error: need more than 1 value to unpack
              [Exception type: ValueError, raised in __init__.py:220


              And infact it gave zerou counts with -i gene. so it means that there should be something else wrong. Could you let me know the commands or steps which you mean to check?
              Last edited by meher; 11-14-2013, 07:01 AM.

              Comment


              • #8
                Not explicitly without seeing how whichever aligner you're using indicates a uniquely aligned read. In general, you can probably just grep for "NH:i:0" and then just put the first thousand or so in a file. That'd be a miniature SAM file. Then just use htseq-count with "-o annotated_input.sam" to see how each read is treated. You might also just want to look at things in IGV.

                The chromosome names are the same between your reference and the annotation file, I hope.

                Comment


                • #9
                  Originally posted by dpryan View Post
                  Not explicitly without seeing how whichever aligner you're using indicates a uniquely aligned read. In general, you can probably just grep for "NH:i:0" and then just put the first thousand or so in a file. That'd be a miniature SAM file. Then just use htseq-count with "-o annotated_input.sam" to see how each read is treated. You might also just want to look at things in IGV.

                  The chromosome names are the same between your reference and the annotation file, I hope.

                  Do you mean "NH:i:1"? because "NH:i:0" seems to be absent when it tried

                  samtools view -S accepted_hits.sam | grep "NH:i:0".

                  Comment


                  • #10
                    Mea culpa, yeah NH:i:1 would be correct :P

                    Comment


                    • #11
                      Originally posted by dpryan View Post
                      Mea culpa, yeah NH:i:1 would be correct :P
                      I have done what u have suggested:

                      htseq-count -i gene mini.sam gff3 -o mini_annot.sam

                      So what should be observed from mini_annot.sam?

                      $ wc -l mini.sam
                      10000 mini.sam
                      $ wc -l mini_annot.sam
                      145 mini_annot.sam

                      It says "XF:Z:alignment_not_unique" at the end of each line in mini_annot.sam file.

                      Any clue out of this?

                      Comment


                      • #12
                        I'm not sure why there are so few alignments in the mini_annot.sam file, but generally just check those to ensure that they actually do map to something in the gff file.
                        Last edited by dpryan; 11-14-2013, 10:51 AM.

                        Comment


                        • #13
                          Originally posted by dpryan View Post
                          I'm not sure why there are so few alignments in the mini_annot.sam file, but generally just check those in the reference GTF to ensure that they actually do map to something in the gff file.
                          There are a few mapping to the features in gff file. It is weird and no clue.

                          Comment


                          • #14
                            Maybe take a subset of the gff file where some of those should map and test that. In general, htseq-count works well, so there's likely something off in the annotation somewhere. So start with a small chunk of the annotation such that things work and then just increase the size until it breaks. You will have then found the source of the problem.

                            Alternatively you can always try summarizeOverlaps() in GenomicRanges in R or featureCounts from subread. Perhaps one of those will ignore whatever is breaking htseq-count.

                            Comment


                            • #15
                              Originally posted by meher View Post
                              It says "XF:Z:alignment_not_unique" at the end of each line in mini_annot.sam file.
                              Just post the SAM file, so we can see what's wrong.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X