SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
tophat/cufflinks no gene names or annotations showing up sagarc88 Bioinformatics 2 10-05-2015 11:27 AM
Compatability of UCSC NCBI and ENSEMBL genomes Noa Bioinformatics 2 07-04-2013 03:45 AM
UCSC gene name to ensembl Jetse Bioinformatics 1 12-14-2012 07:43 AM
gene location on UCSC vs NCBI nguyendofx Bioinformatics 2 01-28-2012 02:39 PM
Ensembl vs NCBI GERALD Bioinformatics 3 02-25-2011 04:37 AM

Reply
 
Thread Tools
Old 11-30-2013, 04:15 PM   #1
sp144
Junior Member
 
Location: Boston, MA

Join Date: Oct 2009
Posts: 7
Default Ensembl/NCBI/UCSC mouse gene annotations for cufflinks

Dear seqA community,

I'm assembling transcripts on the mouse reference annotations (*.gtf files) provided by Ensembl, NCBI and UCSC. Ideally, I would like to use Ensembl, because they annotate genes as protein-coding, non-coding, pseudo-genes, etc. But I have a problem with Ensembl: some important transcripts are not in the database, for example: Kcnq1ot1 or Ipw

Question #1: Why is that? Should I expect these and other similar genes to be included in a future version of Ensembl?

Both Refseq and UCSC have entries for these genes, but they lack the convenient categorization provided by Ensembl (protein-coding, non-coding, pseudogenes, etc.).

Question #2: I have been unable to find an equivalent categorization file matching UCSC or NCBI identifiers. Can someone point me in the right direction?

Thank you for any advice you can give!

Last edited by sp144; 11-30-2013 at 04:46 PM.
sp144 is offline   Reply With Quote
Old 12-01-2013, 03:05 PM   #2
jeppepeppe
Junior Member
 
Location: Sydney

Join Date: Jun 2013
Posts: 1
Default

I would try the new Gencode annotation file for mouse. It should be the most comprehensive annotation out there.
http://www.gencodegenes.org/mouse_stats.html

Sorry, misunderstood.
Not sure why it's not in there.

Last edited by jeppepeppe; 12-01-2013 at 03:12 PM.
jeppepeppe is offline   Reply With Quote
Old 12-01-2013, 03:50 PM   #3
sp144
Junior Member
 
Location: Boston, MA

Join Date: Oct 2009
Posts: 7
Default

Quote:
Originally Posted by jeppepeppe View Post
I would try the new Gencode annotation file for mouse. It should be the most comprehensive annotation out there.
http://www.gencodegenes.org/mouse_stats.html

Sorry, misunderstood.
Not sure why it's not in there.
Thank you jeppepeppe,

that was a good suggestion, but I went and checked and both examples are indeed missing. From what I can tell right now, my only options are to:
a.) exclude genes not listed in ensembl
b.) use UCSC annotation and ID converter tools to retrieve ensembl annotation matching UCSC IDs. But in this process I'll lose ~ 10% of my data, so not ideal:

http://www.scribd.com/doc/18966500/Id-Converters-Test
sp144 is offline   Reply With Quote
Old 12-02-2013, 01:51 AM   #4
Giulietta EnsemblHelpdesk
Junior Member
 
Location: UK

Join Date: Dec 2010
Posts: 4
Default

Hi,

It's good to hear the biotype categorizations are useful to you.

Ensembl will have a more updated mouse gene set than what's on the GENCODE page, as the GENCODE set has been taken from a previous release of Ensembl. (GENCODE is using Ensembl genes- i.e. the merged set between Ensembl automatic annotation and Vega/Havana manual annotation).

We will have an update in mouse genes for the next release (e74), due out this week. (Release 74). This will include updated Vega/Havana manual annotation. I have checked the first gene you mention (KCNQ1OT1) on our test site, and it will be present in the next release.

I hope that helps.
Giulietta EnsemblHelpdesk is offline   Reply With Quote
Old 12-02-2013, 04:44 PM   #5
sp144
Junior Member
 
Location: Boston, MA

Join Date: Oct 2009
Posts: 7
Default

Quote:
Originally Posted by Giulietta EnsemblHelpdesk View Post
Hi,

It's good to hear the biotype categorizations are useful to you.

Ensembl will have a more updated mouse gene set than what's on the GENCODE page, as the GENCODE set has been taken from a previous release of Ensembl. (GENCODE is using Ensembl genes- i.e. the merged set between Ensembl automatic annotation and Vega/Havana manual annotation).

We will have an update in mouse genes for the next release (e74), due out this week. (Release 74). This will include updated Vega/Havana manual annotation. I have checked the first gene you mention (KCNQ1OT1) on our test site, and it will be present in the next release.

I hope that helps.
Thank you, Giulietta!
That is indeed very helpful news and very lucky for me! My data is aligned to mouse mm9 (build 37) however. Will the e74 annotation only be available for mm10/NCBI38 coordinates? Will it be possible to perform a simple liftover back to mm9 coordinates?

I could of course re-align to mm10 (build 38), but for reasons relating to my custom-built pipeline, I'd prefer to stay in mm9 (build 37) if at all possible.
Thank you and Best wishes!

PS. on a related note I'm a bit unclear as to why the transcript biotypes and gene biotypes differ - is it because some transcripts of protein-coding genes are not translated, etc?

Last edited by sp144; 12-02-2013 at 04:46 PM.
sp144 is offline   Reply With Quote
Old 12-03-2013, 12:36 AM   #6
Emily_Ensembl
Member
 
Location: Cambridge UK

Join Date: Dec 2013
Posts: 12
Default

Hi sp44

We don't update old assemblies with the new annotation, so for NCBIm37 you will only see the release 67 annotation from May 2012, as that was the last release with the old assembly.

Gene and transcript biotypes differ because a gene will have multiple transcripts, which will each have their own biotypes. For example, this gene has some coding and some non-coding transcripts.

Emily
Emily_Ensembl is offline   Reply With Quote
Old 12-03-2013, 02:04 AM   #7
Giulietta EnsemblHelpdesk
Junior Member
 
Location: UK

Join Date: Dec 2010
Posts: 4
Default

Hello sp144,

To add to Emily's message, yes you can lift over coordinates of the new annotation to the older assembly. Ensembl provides an assembly converter tool for this:

http://www.ensembl.org/info/docs/tools/index.html

By the way, if you have a list of genes which are not in the most current Ensembl database, we'd like you to send those along to Vega/Havana- they manually annotate genes which we then merge into our geneset generated by automatic annotation. The contact email is in the link:

http://www.sanger.ac.uk/resources/databases/vega/

Best wishes,
Giulietta

Last edited by Giulietta EnsemblHelpdesk; 12-03-2013 at 02:04 AM. Reason: forgot link
Giulietta EnsemblHelpdesk is offline   Reply With Quote
Old 12-04-2013, 09:47 AM   #8
sp144
Junior Member
 
Location: Boston, MA

Join Date: Oct 2009
Posts: 7
Default

Thank you Giulietta and Emily,

I took a look at the new assembly, but sadly Ipw is not annotated at all and Kcnq1ot1 is incorrectly annotated as being on the forward strand and consisting of 5 exons. It's actually on the reverse strand and consists of a single exon. I also don't understand why Kcnq1ot1 is capitalized in the gtf.

I'm surprised given that these genes have long been in Refseq and UCSC. I'll contact the Vega/Havana people - but I'm guessing these won't be updated until the next ensembl release. When do you think it will come out? Thank you!
sp144 is offline   Reply With Quote
Old 12-05-2013, 01:42 AM   #9
Giulietta
Junior Member
 
Location: UK

Join Date: Nov 2010
Posts: 8
Default

Quote:
Originally Posted by sp144 View Post
Thank you Giulietta and Emily,

I took a look at the new assembly, but sadly Ipw is not annotated at all and Kcnq1ot1 is incorrectly annotated as being on the forward strand and consisting of 5 exons. It's actually on the reverse strand and consists of a single exon. I also don't understand why Kcnq1ot1 is capitalized in the gtf.

I'm surprised given that these genes have long been in Refseq and UCSC. I'll contact the Vega/Havana people - but I'm guessing these won't be updated until the next ensembl release. When do you think it will come out? Thank you!
I find Kcnq1ot1 in mouse on the forward strand, consisting of a single exon:

http://www.ensembl.org/Mus_musculus/...UST00000183938

Are we looking at the same gene?
Giulietta is offline   Reply With Quote
Old 12-05-2013, 01:44 AM   #10
Emily_Ensembl
Member
 
Location: Cambridge UK

Join Date: Dec 2013
Posts: 12
Default

I find four of them, KCNQ1OT1_1, KCNQ1OT1_2, KCNQ1OT1_3 and KCNQ1OT1_5, all neighbours on the forward strand.

http://www.ensembl.org/Mus_musculus/...UST00000183763
Emily_Ensembl is offline   Reply With Quote
Old 12-05-2013, 09:03 AM   #11
sp144
Junior Member
 
Location: Boston, MA

Join Date: Oct 2009
Posts: 7
Default

Yes, thank you Emily, in the gtf there are 4 entries, neighbors on the forward strand. But in Refseq and UCSC there is a single 1-exon transcript on the reverse strand, hence the name: Kcnq1 "opposite transcript" 1 = Kcnq1ot1.

I emailed the VEGA group, but no response yet.
Thank you.
sp144 is offline   Reply With Quote
Old 12-09-2013, 08:37 AM   #12
afrankish
Junior Member
 
Location: UK

Join Date: Dec 2013
Posts: 1
Default

Hi

I'm from the HAVANA group at Sanger and although I haven't yet received your email via Vega, I was alerted to this thread via the Ensembl team.

Neither Ipw or Kcnq1ot1 had been manually annotated, but this is not entirely surprising as we have only just started genome-wide manual annotation of non-coding loci in mouse.

I have had the annotation for these loci updated and both will appear in future releases of GENCODE/Ensembl. Just to clarify, the GENCODE and Ensembl genesets are identical (essentially, for human and mouse, Ensembl displays the GENCODE geneset which is created via a merge of manual gene annotation and Ensembl gene predictions) and released in synch (this is well established for human, and while the Ensembl geneset for mouse has been created in the same way as human for several years the separate release of GENCODE gene annotation is more limited - GENCODE M1=Ensembl 65 and GENCODE M2=Ensembl 74). Updates to annotation can take some time to appear in new releases of GENCODE/Ensembl, however, it is possible to see updated manual annotation (which will be included in future releases) via the Vega browser. Click through 'Configure this page' and then click on the 'Havana update' box in the Genes and transcripts section. This track is updated approximately fortnightly.

I hope this is useful

Last edited by afrankish; 12-09-2013 at 08:50 AM.
afrankish is offline   Reply With Quote
Old 12-09-2013, 02:16 PM   #13
sp144
Junior Member
 
Location: Boston, MA

Join Date: Oct 2009
Posts: 7
Default

Thank you, afrankish; I'm mostly looking for a gtf annotation that includes the very useful Ensembl biotype categories yet captures RefSeq and UCSC gene entries missing from Ensembl. I'm sure updating these transcript annotations is challenging, as they represent a moving target with increasing sequencing depth. I just wish there was a mechanism to "fast-track" entries from other major databases for annotation. Both of these genes have been in RefSeq and UCSC for quite some time.

I will keep an eye out for the next ensembl release. Thank you to everyone for contributing to this post - it was my first on seqanswers and I'm impressed that you VEGA and Ensembl folks responded so quickly. Thank you!
sp144 is offline   Reply With Quote
Old 12-10-2013, 12:53 AM   #14
Giulietta
Junior Member
 
Location: UK

Join Date: Nov 2010
Posts: 8
Default

Hi sp144,

Just to clarify, in the Ensembl pipeline Kcnq1ot1 has been annotated from RFAM, which has four separate entries:

RF01946 KCNQ1OT1_1 KCNQ1 overlapping transcript 1 conserved region 1
RF01947 KCNQ1OT1_2 KCNQ1 overlapping transcript 1 conserved region 2
RF01948 KCNQ1OT1_3 KCNQ1 overlapping transcript 1 conserved region 3
RF01950 KCNQ1OT1_5 KCNQ1 overlapping transcript 1 conserved region 5

The Ensembl pipeline's strength is very much on coding sequences, and we prefer to receive annotation on ncRNAs from Havana (who manually annotate the genome). As afrankish points out, we merge the Havana manual annotation into the transcript set from the Ensembl automatic pipelines to create the GENCODE set.

We hope to have this annotation for you in the future.
Giulietta is offline   Reply With Quote
Reply

Tags
annotation, cufflinks, ensembl, refseq, ucsc

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO