Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • super0925
    Senior Member
    • Feb 2014
    • 206

    how to add a pseduo gene into a GTF file?

    Hi all

    When do RNA-Seq analysis, after I have finished the alignment by Tophat and get the accpeted_hits.bam. Now I want to looked at if it is has differential expression at the corresponding area (We called it 'pseduo gene' which haven't been annotated by reference genome ). How could I do it ?

    I think the best way is to add this pseduo gene to reference genome GTF file. And then I could play HTSeq-count or Cuffdiff to look at the differential expression of this gene .

    Here is the effort I made so far . I have add the Chr21:100000-200000 as 'exon' (a pseudo gene) to the genes.gtf of Ensembl, rename the GTF file.
    Code:
    $  head genes_hello.gtf
    [B]29       protein_coding  exon    28820825   28834944   .       +       .       exon_id "ENSECAE00000000001"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000099999"; gene_name "HELLO"; gene_source "ensembl"; p_id "P99999"; transcript_id "ENSECAT00000099999"; transcript_name "HELLO-001"; transcript_source "ensembl"; tss_id "TSS9999";[/B]
    1       protein_coding  UTR     11193   11209   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
    1       protein_coding  exon    11193   11261   .       +       .       exon_id "ENSECAE00000079002"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
    1       protein_coding  transcript      11193   15975   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
    However, if I run the Cuffdiff directly, the output ( i.e. gene_exp.diff) didn't contain any my pseudo gene information. And if I ran HTSeq-count , it gave me the error:And if I ran HTSeq-count , it gave me the error:

    [CODE]$ htseq-count -s yes aligned_sorted.sam genes_hello.gtf>gene_count.txt

    Error occured when processing GFF file (line 1 of file genes.gtf):
    need more than 1 value to unpack
    [Exception type: ValueError, raised in __init__.py:207]


    It seems that I didn't add the pseduo gene properly. Does anyone know how to do it?

    Thank you very much!
    Last edited by super0925; 01-12-2016, 09:59 AM.
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    For reference cross-posted at: https://www.biostars.org/p/172385/

    Comment

    • westerman
      Rick Westerman
      • Jun 2008
      • 1104

      #3
      I'm not sure I agree with Aminm from Biostars. He said that you need corresponding lines (exon, UTR, etc) in the GTF file but I do not believe this is true. Rather I am going with the oh-so-basic idea of bad formatting, in particular the lack of tabs due to the complaint that htseq-count gave you. Can you do a:

      Code:
      head --lines=2  genes_hello.gtf | od -c
      And send that to us? Yes, I know that the chances of a simple formatting mistake are low but I have had such errors all too many times thus always wish to check formatting first.

      Comment

      • super0925
        Senior Member
        • Feb 2014
        • 206

        #4
        Originally posted by westerman View Post
        I'm not sure I agree with Aminm from Biostars. He said that you need corresponding lines (exon, UTR, etc) in the GTF file but I do not believe this is true. Rather I am going with the oh-so-basic idea of bad formatting, in particular the lack of tabs due to the complaint that htseq-count gave you. Can you do a:

        Code:
        head --lines=2  genes_hello.gtf | od -c
        And send that to us? Yes, I know that the chances of a simple formatting mistake are low but I have had such errors all too many times thus always wish to check formatting first.
        I remember I have added not only exon but also UTR, but I failed. So I only keep the exon and post this thread.
        I tried your command

        Code:
        $ head --lines=2  henan.gtf | od -c
        0000000   2   9                               p   r   o   t   e   i   n
        0000020   _   c   o   d   i   n   g           e   x   o   n
        0000040       2   8   8   2   0   8   2   5               2   8   8   3
        0000060   4   9   4   4               .                               +
        0000100                               .                               e
        0000120   x   o   n   _   i   d       "   E   N   S   E   C   A   E   0
        0000140   0   0   0   0   0   0   0   0   0   1   "   ;       e   x   o
        0000160   n   _   n   u   m   b   e   r       "   1   "   ;       g   e
        0000200   n   e   _   b   i   o   t   y   p   e       "   p   r   o   t
        0000220   e   i   n   _   c   o   d   i   n   g   "   ;       g   e   n
        0000240   e   _   i   d       "   E   N   S   E   C   A   G   0   0   0
        0000260   0   0   0   9   9   9   9   9   "   ;       g   e   n   e   _
        0000300   n   a   m   e       "   H   E   L   L   O   "   ;       g   e
        0000320   n   e   _   s   o   u   r   c   e       "   e   n   s   e   m
        0000340   b   l   "   ;       p   _   i   d       "   P   9   9   9   9
        0000360   9   "   ;       t   r   a   n   s   c   r   i   p   t   _   i
        0000400   d       "   E   N   S   E   C   A   T   0   0   0   0   0   0
        0000420   9   9   9   9   9   "   ;       t   r   a   n   s   c   r   i
        0000440   p   t   _   n   a   m   e       "   H   E   L   L   O   -   0
        0000460   0   1   "   ;       t   r   a   n   s   c   r   i   p   t   _
        0000500   s   o   u   r   c   e       "   e   n   s   e   m   b   l   "
        0000520   ;       t   s   s   _   i   d       "   T   S   S   9   9   9
        0000540   9   "   ;  \n   1  \t   p   r   o   t   e   i   n   _   c   o
        0000560   d   i   n   g           U   T   R  \t   1   1   1   9   3  \t
        0000600   1   1   2   0   9  \t   .  \t   +  \t   .  \t   g   e   n   e
        0000620   _   b   i   o   t   y   p   e       "   p   r   o   t   e   i
        0000640   n   _   c   o   d   i   n   g   "   ;       g   e   n   e   _
        0000660   i   d       "   E   N   S   E   C   A   G   0   0   0   0   0
        0000700   0   1   2   4   2   1   "   ;       g   e   n   e   _   n   a
        0000720   m   e       "   S   Y   C   E   1   "   ;       g   e   n   e
        0000740   _   s   o   u   r   c   e       "   e   n   s   e   m   b   l
        0000760   "   ;       p   _   i   d       "   P   2   0   9   7   5   "
        0001000   ;       t   r   a   n   s   c   r   i   p   t   _   i   d
        0001020   "   E   N   S   E   C   A   T   0   0   0   0   0   0   1   3
        0001040   0   0   4   "   ;       t   r   a   n   s   c   r   i   p   t
        0001060   _   n   a   m   e       "   S   Y   C   E   1   -   2   0   1
        0001100   "   ;       t   r   a   n   s   c   r   i   p   t   _   s   o
        0001120   u   r   c   e       "   e   n   s   e   m   b   l   "   ;
        0001140   t   s   s   _   i   d       "   T   S   S   1   0   1   3   "
        0001160   ;  \n
        0001162
        Do you have idea?
        Last edited by super0925; 01-12-2016, 09:48 AM.

        Comment

        • westerman
          Rick Westerman
          • Jun 2008
          • 1104

          #5
          Exactly what I suspected. You did not separate the fields by tabs but rather by spaces. Compare the first line (no tabs) to the second line (tabs).

          Comment

          • super0925
            Senior Member
            • Feb 2014
            • 206

            #6
            Originally posted by westerman View Post
            Exactly what I suspected. You did not separate the fields by tabs but rather by spaces. Compare the first line (no tabs) to the second line (tabs).
            Brilliant you are! I will try it !

            PS: So do I need to add those UTR, CDS, transcript, etc? I don't think so. I only want to look at differential gene expression of this pseudo gene (I am sure there are reads mapped in this pseudo gene) but not CDS, isoform etc. What is your opinion?

            Comment

            • westerman
              Rick Westerman
              • Jun 2008
              • 1104

              #7
              Originally posted by super0925 View Post
              PS: So do I need to add those UTR, CDS, transcript, etc? I don't think so. I only want to look at differential gene expression of this pseudo gene (I am sure there are reads mapped in this pseudo gene) but not CDS, isoform etc. What is your opinion?
              I agree. Only add the pseudo gene. Just double check your tabs!

              Comment

              • super0925
                Senior Member
                • Feb 2014
                • 206

                #8
                Originally posted by westerman View Post
                I agree. Only add the pseudo gene. Just double check your tabs!
                Done! Thx!

                Comment

                Latest Articles

                Collapse

                • GATTACAT
                  Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by GATTACAT
                  Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                  07-01-2026, 11:43 AM
                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 07-02-2026, 11:08 AM
                0 responses
                7 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-30-2026, 05:37 AM
                0 responses
                12 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-26-2026, 11:10 AM
                0 responses
                20 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                54 views
                0 reactions
                Last Post SEQadmin2  
                Working...