SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Refseq FPKM values for genes (hg19) AmitPande RNA Sequencing 2 11-07-2013 01:10 AM
Refseq FPKM values for genes (hg19) AmitPande Bioinformatics 0 11-06-2013 05:50 AM
Non-species refseq genes in UCSC Genome Browser stephenhart Bioinformatics 0 02-14-2012 10:03 PM
transcriptome -> predicted peptide database Seqasaurus Bioinformatics 4 07-15-2011 05:55 AM
gsAssembler - predicted genome size? Jordy224 Bioinformatics 2 11-22-2010 09:27 PM

Reply
 
Thread Tools
Old 04-24-2014, 06:39 AM   #1
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default Predicted Ensembl genes, but not in RefSeq

Im curious, I find a lot of reads mapped to "Gm" annotated genes from Ensembl, which are predicted genes.

When Im mapping to the UCSC genome (with novel discovery) I don't find anything..

Could someone shed light on this? And is the "Gm" genes something to pursue?

Im using Cufflinks-pipeline for this.
sindrle is offline   Reply With Quote
Old 04-24-2014, 05:22 PM   #2
yueluo
Member
 
Location: Guangzhou China

Join Date: Aug 2013
Posts: 82
Default

FYI, Ensembl tends to have many many more annotated genes/transcripts than UCSC/RefSeq. So I'd say it's quite normal if you can't find anything in UCSC.

I'm not familiar with "Gm" genes though.
yueluo is offline   Reply With Quote
Old 04-25-2014, 05:22 AM   #3
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

The convention by the International Nucleotide Sequence Database Collaboration is that the accession prefix "GM" is supposed to be used for EMBL nucleotide patent entries, so I am not clear as to just what annotation you used to map to.

Where did you actually get your reference genome and annotation you used for the mapping run?

I've never seen any predicted genes with that accession prefix in the Ensembl builds I've mapped to (Rat, not human in my case). I've always downloaded my mapping reference and annotation directly from Ensembl. Predicted genes use the standard "ENSRNOGxxx..." and transcripts use the standard "ENSRNOTxxx..." form and it is only in the annotation description that one can determine if it was a predicted entry or not. Those entries will show up in UCSC as predicted entries with their respective RefSeq predicted entry.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.

Last edited by mbblack; 04-25-2014 at 05:36 AM.
mbblack is offline   Reply With Quote
Old 04-25-2014, 05:28 AM   #4
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Im using Ensembl for mouse. But downloaded from iGenomes (made for Tophat2, via Illumina).
sindrle is offline   Reply With Quote
Old 04-25-2014, 05:45 AM   #5
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

I had similar "problems" using the human hg19 assembly from different sources, until I found this paper "Assessing the impact of human genome annotation choice on RNA-seq expression estimates" which scientifically supports yueluo's statement
WhatsOEver is offline   Reply With Quote
Old 04-25-2014, 06:07 AM   #6
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

Looking in the actual "Mus_musculus.GRCm38.75.gtf" file from Ensembl, yes in the descriptors there are Gmxxxxx accessions (but those are NOT Ensembl accessions).

E.G. gene_id "ENSMUSG00000088333"; transcript_id "ENSMUST00000157708"; exon_number "1"; gene_name "Gm22848"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm22848-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000846843";

So, Gm22848 is actually a Flybase accession and those entries in Ensembl and Refseq will be handled on a case-by-case basis and manually curated, so some will not be in refseq at all, and those that are are likely to be provisional entries. Odds are any of those are pseudogenes in any mammal.

Regardless, if you want to track those, I would not use the Flybase or any other associated meta-data with those entries. Use the actual Ensembl gene or transcript IDs and they should track through UCSC and NCBI data just fine. The match to Gmxxxxx is just the best available homology match, which happens to be Drosophila genes.

A couple of others I quickly checked do have MGI entries, but they come up as not in the current assembly. But these are all from the HAVANA project (i.e. the Human and Vertebrate Analysis and Annotation team) so these entries are going to be problematic as they will be changing as evidence for those ORFs changes.

P.S. bear in mind that the current Enzembl mouse build has 5935 pseudogenes (or putative pseudogenes) in it, and for many of those the annotation may be in flux and thus not necessarily synchronized across different databases. The same thing goes for the readthrough transcripts, which are also manually curated by the HAVANA team.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.

Last edited by mbblack; 04-25-2014 at 06:18 AM.
mbblack is offline   Reply With Quote
Old 04-25-2014, 07:49 AM   #7
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Great answer! Thank you!

But, excuse my ignorance, what biological relevant questions might be answered by analysing the Ensembl Gm-genes. As you mentioned:

E.G. gene_id "ENSMUSG00000088333"; transcript_id "ENSMUST00000157708"; exon_number "1"; gene_name "Gm22848"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm22848-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000846843";
sindrle is offline   Reply With Quote
Old 04-25-2014, 08:24 AM   #8
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

Oh, sorry. I should have added I would not waste time pursuing them. They mostly, if not exclusively, appear to be pseudogenes, so unless you are specifically interested in something about pseudogenes, I'd ignore them.

That line was just a random one I pulled from the GTF file as an example - GTF file from here: http://uswest.ensembl.org/info/data/ftp/index.html
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.

Last edited by mbblack; 04-25-2014 at 08:27 AM.
mbblack is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO