Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predicted Ensembl genes, but not in RefSeq

    Im curious, I find a lot of reads mapped to "Gm" annotated genes from Ensembl, which are predicted genes.

    When Im mapping to the UCSC genome (with novel discovery) I don't find anything..

    Could someone shed light on this? And is the "Gm" genes something to pursue?

    Im using Cufflinks-pipeline for this.

  • #2
    FYI, Ensembl tends to have many many more annotated genes/transcripts than UCSC/RefSeq. So I'd say it's quite normal if you can't find anything in UCSC.

    I'm not familiar with "Gm" genes though.

    Comment


    • #3
      The convention by the International Nucleotide Sequence Database Collaboration is that the accession prefix "GM" is supposed to be used for EMBL nucleotide patent entries, so I am not clear as to just what annotation you used to map to.

      Where did you actually get your reference genome and annotation you used for the mapping run?

      I've never seen any predicted genes with that accession prefix in the Ensembl builds I've mapped to (Rat, not human in my case). I've always downloaded my mapping reference and annotation directly from Ensembl. Predicted genes use the standard "ENSRNOGxxx..." and transcripts use the standard "ENSRNOTxxx..." form and it is only in the annotation description that one can determine if it was a predicted entry or not. Those entries will show up in UCSC as predicted entries with their respective RefSeq predicted entry.
      Last edited by mbblack; 04-25-2014, 05:36 AM.
      Michael Black, Ph.D.
      ScitoVation LLC. RTP, N.C.

      Comment


      • #4
        Im using Ensembl for mouse. But downloaded from iGenomes (made for Tophat2, via Illumina).

        Comment


        • #5
          I had similar "problems" using the human hg19 assembly from different sources, until I found this paper "Assessing the impact of human genome annotation choice on RNA-seq expression estimates" which scientifically supports yueluo's statement

          Comment


          • #6
            Looking in the actual "Mus_musculus.GRCm38.75.gtf" file from Ensembl, yes in the descriptors there are Gmxxxxx accessions (but those are NOT Ensembl accessions).

            E.G. gene_id "ENSMUSG00000088333"; transcript_id "ENSMUST00000157708"; exon_number "1"; gene_name "Gm22848"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm22848-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000846843";

            So, Gm22848 is actually a Flybase accession and those entries in Ensembl and Refseq will be handled on a case-by-case basis and manually curated, so some will not be in refseq at all, and those that are are likely to be provisional entries. Odds are any of those are pseudogenes in any mammal.

            Regardless, if you want to track those, I would not use the Flybase or any other associated meta-data with those entries. Use the actual Ensembl gene or transcript IDs and they should track through UCSC and NCBI data just fine. The match to Gmxxxxx is just the best available homology match, which happens to be Drosophila genes.

            A couple of others I quickly checked do have MGI entries, but they come up as not in the current assembly. But these are all from the HAVANA project (i.e. the Human and Vertebrate Analysis and Annotation team) so these entries are going to be problematic as they will be changing as evidence for those ORFs changes.

            P.S. bear in mind that the current Enzembl mouse build has 5935 pseudogenes (or putative pseudogenes) in it, and for many of those the annotation may be in flux and thus not necessarily synchronized across different databases. The same thing goes for the readthrough transcripts, which are also manually curated by the HAVANA team.
            Last edited by mbblack; 04-25-2014, 06:18 AM.
            Michael Black, Ph.D.
            ScitoVation LLC. RTP, N.C.

            Comment


            • #7
              Great answer! Thank you!

              But, excuse my ignorance, what biological relevant questions might be answered by analysing the Ensembl Gm-genes. As you mentioned:

              E.G. gene_id "ENSMUSG00000088333"; transcript_id "ENSMUST00000157708"; exon_number "1"; gene_name "Gm22848"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm22848-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000846843";

              Comment


              • #8
                Oh, sorry. I should have added I would not waste time pursuing them. They mostly, if not exclusively, appear to be pseudogenes, so unless you are specifically interested in something about pseudogenes, I'd ignore them.

                That line was just a random one I pulled from the GTF file as an example - GTF file from here: http://uswest.ensembl.org/info/data/ftp/index.html
                Last edited by mbblack; 04-25-2014, 08:27 AM.
                Michael Black, Ph.D.
                ScitoVation LLC. RTP, N.C.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 08:47 AM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Working...
                X