Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to add a pseduo gene into a GTF file?

    Hi all

    When do RNA-Seq analysis, after I have finished the alignment by Tophat and get the accpeted_hits.bam. Now I want to looked at if it is has differential expression at the corresponding area (We called it 'pseduo gene' which haven't been annotated by reference genome ). How could I do it ?

    I think the best way is to add this pseduo gene to reference genome GTF file. And then I could play HTSeq-count or Cuffdiff to look at the differential expression of this gene .

    Here is the effort I made so far . I have add the Chr21:100000-200000 as 'exon' (a pseudo gene) to the genes.gtf of Ensembl, rename the GTF file.
    Code:
    $  head genes_hello.gtf
    [B]29       protein_coding  exon    28820825   28834944   .       +       .       exon_id "ENSECAE00000000001"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000099999"; gene_name "HELLO"; gene_source "ensembl"; p_id "P99999"; transcript_id "ENSECAT00000099999"; transcript_name "HELLO-001"; transcript_source "ensembl"; tss_id "TSS9999";[/B]
    1       protein_coding  UTR     11193   11209   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
    1       protein_coding  exon    11193   11261   .       +       .       exon_id "ENSECAE00000079002"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
    1       protein_coding  transcript      11193   15975   .       +       .       gene_biotype "protein_coding"; gene_id "ENSECAG00000012421"; gene_name "SYCE1"; gene_source "ensembl"; p_id "P20975"; transcript_id "ENSECAT00000013004"; transcript_name "SYCE1-201"; transcript_source "ensembl"; tss_id "TSS1013";
    However, if I run the Cuffdiff directly, the output ( i.e. gene_exp.diff) didn't contain any my pseudo gene information. And if I ran HTSeq-count , it gave me the error:And if I ran HTSeq-count , it gave me the error:

    [CODE]$ htseq-count -s yes aligned_sorted.sam genes_hello.gtf>gene_count.txt

    Error occured when processing GFF file (line 1 of file genes.gtf):
    need more than 1 value to unpack
    [Exception type: ValueError, raised in __init__.py:207]


    It seems that I didn't add the pseduo gene properly. Does anyone know how to do it?

    Thank you very much!
    Last edited by super0925; 01-12-2016, 09:59 AM.

  • #2
    For reference cross-posted at: https://www.biostars.org/p/172385/

    Comment


    • #3
      I'm not sure I agree with Aminm from Biostars. He said that you need corresponding lines (exon, UTR, etc) in the GTF file but I do not believe this is true. Rather I am going with the oh-so-basic idea of bad formatting, in particular the lack of tabs due to the complaint that htseq-count gave you. Can you do a:

      Code:
      head --lines=2  genes_hello.gtf | od -c
      And send that to us? Yes, I know that the chances of a simple formatting mistake are low but I have had such errors all too many times thus always wish to check formatting first.

      Comment


      • #4
        Originally posted by westerman View Post
        I'm not sure I agree with Aminm from Biostars. He said that you need corresponding lines (exon, UTR, etc) in the GTF file but I do not believe this is true. Rather I am going with the oh-so-basic idea of bad formatting, in particular the lack of tabs due to the complaint that htseq-count gave you. Can you do a:

        Code:
        head --lines=2  genes_hello.gtf | od -c
        And send that to us? Yes, I know that the chances of a simple formatting mistake are low but I have had such errors all too many times thus always wish to check formatting first.
        I remember I have added not only exon but also UTR, but I failed. So I only keep the exon and post this thread.
        I tried your command

        Code:
        $ head --lines=2  henan.gtf | od -c
        0000000   2   9                               p   r   o   t   e   i   n
        0000020   _   c   o   d   i   n   g           e   x   o   n
        0000040       2   8   8   2   0   8   2   5               2   8   8   3
        0000060   4   9   4   4               .                               +
        0000100                               .                               e
        0000120   x   o   n   _   i   d       "   E   N   S   E   C   A   E   0
        0000140   0   0   0   0   0   0   0   0   0   1   "   ;       e   x   o
        0000160   n   _   n   u   m   b   e   r       "   1   "   ;       g   e
        0000200   n   e   _   b   i   o   t   y   p   e       "   p   r   o   t
        0000220   e   i   n   _   c   o   d   i   n   g   "   ;       g   e   n
        0000240   e   _   i   d       "   E   N   S   E   C   A   G   0   0   0
        0000260   0   0   0   9   9   9   9   9   "   ;       g   e   n   e   _
        0000300   n   a   m   e       "   H   E   L   L   O   "   ;       g   e
        0000320   n   e   _   s   o   u   r   c   e       "   e   n   s   e   m
        0000340   b   l   "   ;       p   _   i   d       "   P   9   9   9   9
        0000360   9   "   ;       t   r   a   n   s   c   r   i   p   t   _   i
        0000400   d       "   E   N   S   E   C   A   T   0   0   0   0   0   0
        0000420   9   9   9   9   9   "   ;       t   r   a   n   s   c   r   i
        0000440   p   t   _   n   a   m   e       "   H   E   L   L   O   -   0
        0000460   0   1   "   ;       t   r   a   n   s   c   r   i   p   t   _
        0000500   s   o   u   r   c   e       "   e   n   s   e   m   b   l   "
        0000520   ;       t   s   s   _   i   d       "   T   S   S   9   9   9
        0000540   9   "   ;  \n   1  \t   p   r   o   t   e   i   n   _   c   o
        0000560   d   i   n   g           U   T   R  \t   1   1   1   9   3  \t
        0000600   1   1   2   0   9  \t   .  \t   +  \t   .  \t   g   e   n   e
        0000620   _   b   i   o   t   y   p   e       "   p   r   o   t   e   i
        0000640   n   _   c   o   d   i   n   g   "   ;       g   e   n   e   _
        0000660   i   d       "   E   N   S   E   C   A   G   0   0   0   0   0
        0000700   0   1   2   4   2   1   "   ;       g   e   n   e   _   n   a
        0000720   m   e       "   S   Y   C   E   1   "   ;       g   e   n   e
        0000740   _   s   o   u   r   c   e       "   e   n   s   e   m   b   l
        0000760   "   ;       p   _   i   d       "   P   2   0   9   7   5   "
        0001000   ;       t   r   a   n   s   c   r   i   p   t   _   i   d
        0001020   "   E   N   S   E   C   A   T   0   0   0   0   0   0   1   3
        0001040   0   0   4   "   ;       t   r   a   n   s   c   r   i   p   t
        0001060   _   n   a   m   e       "   S   Y   C   E   1   -   2   0   1
        0001100   "   ;       t   r   a   n   s   c   r   i   p   t   _   s   o
        0001120   u   r   c   e       "   e   n   s   e   m   b   l   "   ;
        0001140   t   s   s   _   i   d       "   T   S   S   1   0   1   3   "
        0001160   ;  \n
        0001162
        Do you have idea?
        Last edited by super0925; 01-12-2016, 09:48 AM.

        Comment


        • #5
          Exactly what I suspected. You did not separate the fields by tabs but rather by spaces. Compare the first line (no tabs) to the second line (tabs).

          Comment


          • #6
            Originally posted by westerman View Post
            Exactly what I suspected. You did not separate the fields by tabs but rather by spaces. Compare the first line (no tabs) to the second line (tabs).
            Brilliant you are! I will try it !

            PS: So do I need to add those UTR, CDS, transcript, etc? I don't think so. I only want to look at differential gene expression of this pseudo gene (I am sure there are reads mapped in this pseudo gene) but not CDS, isoform etc. What is your opinion?

            Comment


            • #7
              Originally posted by super0925 View Post
              PS: So do I need to add those UTR, CDS, transcript, etc? I don't think so. I only want to look at differential gene expression of this pseudo gene (I am sure there are reads mapped in this pseudo gene) but not CDS, isoform etc. What is your opinion?
              I agree. Only add the pseudo gene. Just double check your tabs!

              Comment


              • #8
                Originally posted by westerman View Post
                I agree. Only add the pseudo gene. Just double check your tabs!
                Done! Thx!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                27 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                26 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X