Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTseq count parsing headache

    Hi,

    I've searched the forum for a problem similar to mine, and there is a thread dealing with, but it's form two years ago and sometimes it's hard to get help on resurrected threads.

    So, I have my RNAseq data back, I've aligned it with Bowtie2. I've sorted and coverted the sam files into sorted bam files with samtools. I've now trying to use HTseq for couting reads, but I get a parsing error.

    Specifically, I've running this command

    Code:
     samtools view 11A_align_sort.bam | htseq-count -s no  -i ID - ~/aedes_genomic/Aedes-aegypti-Liverpool_BASEFEATURES.gff3 > 11A_count
    and get the error

    Code:
    Error occured in line 15 of file /home/emiliano/aedes_genomic/Aedes-aegypti-Liverpool_BASEFEATURES.gff3.
    Error: Failure parsing GFF attribute line
    [Exception type: ValueError, raised in __init__.py:171]
    For reference, these are the first 25 lines from the gff3 file:

    Code:
    ##gff-version 3
    ##feature-ontology so.obo
    ##attribute-ontology gff3_attributes.obo
    #
    # Dumped from  database.
    # 
    # using SO:0000704 for VectorBase Gene
    # using SO:0000234 for VectorBase Transcript
    # using SO:0000147 for VectorBase Exon
    # using SO:0000316 for VectorBase CDS
    # using SO:0000204 for VectorBase 5'UTR 
    # using SO:0000205 for VectorBase 3'UTR
    # 
    ##sequence-region supercontig:AaegL1:supercont1.1:1:5856339:1
    supercont1.1	VectorBase	contig	1	5856339	.	.	.	ID=supercont1.1;molecule_type=dsDNA;GenBank:supercontig:AaegL1:supercont1.1:1:5856339:1;translation_table=1;topology=linear;localization=chromosomal;
    supercont1.1	VectorBase	gene	35414	53420	.	+	.	ID=AAEL000064;
    supercont1.1	VectorBase	mRNA	35414	53420	.	+	.	ID=AAEL000064-RA;Parent=AAEL000064;Dbxref=UniProtKB:Q8T4S1,UniProtKB:Q8T4S2,UniProtKB:Q8T4S3,UniProtKB:Q8T4S4,UniProtKB:Q8T4S5,UniProtKB:Q8T4S6,UniProtKB:Q9GSZ4,GenBank:AF288384,GenBank:AY064094,GenBank:AY064095,GenBank:AY064096,GenBank:AY064097,GenBank:AY064098,GenBank:AY064099,GenBank:CH477186,protein_id:AAG01014,protein_id:AAL85595,protein_id:AAL85596,protein_id:AAL85597,protein_id:AAL85598,protein_id:AAL85599,protein_id:AAL85600,protein_id:EAT48898,UniParc:UPI000007F997;description=hypothetical protein;
    supercont1.1	VectorBase	exon	35414	35644	.	+	.	ID=E036654A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	exon	35699	35901	.	+	.	ID=E036655A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	exon	35993	36098	.	+	.	ID=E036656A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	exon	52193	52851	.	+	.	ID=E036657A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	exon	52973	53420	.	+	.	ID=E036658A;Parent=AAEL000064-RA;
    supercont1.1	VectorBase	five_prime_utr	35414	35523	.	+	.	Parent=AAEL000064-RA
    supercont1.1	VectorBase	CDS	35524	35644	.	+	0	Parent=AAEL000064-RA;
    supercont1.1	VectorBase	CDS	35699	35901	.	+	2	Parent=AAEL000064-RA;
    I've not entirely sure what part of the line is causing the parsing to fail. It says it's on line 15, which is the first non-commented line, but it's a contig. Wouldn't HTseq skip this and go for the lines with exons?
    Last edited by pecanton; 07-08-2013, 09:00 AM.

  • #2
    Also, I saw an error in my code to run HTseq. the -i ID option I was using only works if I also put -t gene, since that's the type of feature that has that ID. However, I'm not sure if I should count by exon or by gene. I suppose it depends on the downstream analysis I want to do, but I'm not very clear on the benefits of using either couting procedure.

    Comment


    • #3
      For some reason htseq-count does not like everything in line 15 after the second semicolon.

      Code:
      supercont1.1	VectorBase	contig	1	5856339	.	.	.	ID=supercont1.1;molecule_type=dsDNA;[COLOR="Red"]GenBank:supercontig:AaegL1:supercont1.1:1:5856339:1;translation_table=1;topology=linear;localization=chromosomal;[/COLOR]
      If I delete everything highlighted in red, it proceeds normally.

      Comment


      • #4
        Originally posted by pecanton View Post
        Also, I saw an error in my code to run HTseq. the -i ID option I was using only works if I also put -t gene, since that's the type of feature that has that ID. However, I'm not sure if I should count by exon or by gene. I suppose it depends on the downstream analysis I want to do, but I'm not very clear on the benefits of using either couting procedure.
        Thats because your ID's for your exons are all different and would need to be modified to match the ID for the gene. That would be a pain in the ***.

        Comment


        • #5
          Well, since the Aedes aegypti genome is not fully assembled, and exists as supercotings (more than 2000), it would still be a pain to go and edit every one of the "contig" lines from the gff3. I'll report back if I find an easier way.

          Comment


          • #6
            If it's just choking on the "contig" entries, which will be later ignored anyway, then just write a small script to edit those and leave the others unchanged. Alternatively, you should just be able to use grep to remove them (something like "grep -v -w 'contig' ...").

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            14 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X