Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • entrez ID conversion

    Hello,

    does anyone know how to convert entrez I.D. to either Refseq ID or Gene Symbols?
    I have found resources on Refseq to Gene Symbol conversion, but I can't find anything on Entrez I.D.
    The genome I work with is C. elegans.
    Thanks in advance for any suggestion

  • #2
    Try UniProt's online conversion service: http://www.uniprot.org -> "ID Mapping" tab

    Comment


    • #3
      NCBI maintains a flatfiles of gene annotations which contains the information you're after:
      ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
      ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
      ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
      [ There are other interesting files in that directory ]


      The tax_id (taxonomy ID for C.Elgans is 6239 ) [ from Taxonomy browser http://www.ncbi.nlm.nih.gov/taxonomy ]

      You can type : "wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz" from the command line, or download via a browser.

      Example using this data :
      bash-3.00$ cat gene2refseq | awk '{if ($1==6239) print $0}' | head
      6239 171590 REVIEWED NM_058260.3 193203640 NP_490660.1 17510631 NC_003279.6 193203938 4123 10231 - -
      6239 171591 REVIEWED NM_058259.3 193203639 NP_490661.1 17510629 NC_003279.6 193203938 11498 16830 + -
      6239 171592 REVIEWED NM_058261.3 133902001 NP_490662.1 17510633 NC_003279.6 193203938 17496 26780 - -
      6239 171592 REVIEWED NM_058262.3 86561628 NP_490663.1 17510635 NC_003279.6 193203938 17496 26780 - -
      6239 171593 REVIEWED NM_058263.3 115533565 NP_490664.2 115533566 NC_003279.6 193203938 27594 32481 - -
      6239 171594 REVIEWED NM_058265.3 71995026 NP_490666.2 25143331 NC_003279.6 193203938 49918 54359 + -
      6239 171595 REVIEWED NM_058267.4 115533567 NP_490668.4 115533568 NC_003279.6 193203938 55315 64020 - -
      6239 171597 REVIEWED NM_058269.2 71995034 NP_490670.1 17510145 NC_003279.6 193203938 85044 86283 - -
      6239 171599 REVIEWED NM_058271.6 212645149 NP_490672.2 25143337 NC_003279.6 193203938 93030 94880 + -
      6239 171600 REVIEWED NM_058272.4 212645150 NP_490673.1 17510147 NC_003279.6 193203938 96478 100612 - -
      -bash-3.00$ cat gene_info | grep 171590 | awk '{if ($1==6239) print $0}'
      6239 171590 Y74C9A.3 Y74C9A.3 - WormBase:WBGene00022277 I - hypothetical protein protein-coding - - - - 20101017

      Comment


      • #4
        DAVID has a Gene ID Conversion tool:



        Fuad

        Comment


        • #5
          Bioconductor package "biomaRt" also could do it.

          Comment


          • #6
            In Bioconductor, just use the following codes:

            > library(org.Hs.eg.db)
            > library(annotate)
            > lookUp('3815', 'org.Hs.eg', 'SYMBOL')
            $`3815`
            [1] "KIT"

            > lookUp('3815', 'org.Hs.eg', 'REFSEQ')
            $`3815`
            [1] "NM_000222" "NM_001093772" "NP_000213" "NP_001087241"

            Comment


            • #7
              You can also do ID conversion using Biomart at EBI.

              Comment


              • #8
                Always a fan of the linux one-liner, here is an example for human ACTB gene using hg18:

                mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e "select k2ll.value as entrezGeneId, kx.refseq as refseqMrna, kx.geneSymbol as entrezGeneSymbol, kx.description as entrezGeneDesc from kgXref kx, knownToLocusLink k2ll where k2ll.name=kx.kgID and kx.geneSymbol='ACTB';"
                UCSC's C.elegans tables don't include the knownGene and kg% tables, but some poking around ( using "show tables like '%locus%';" ) led me to formulate this MySQL query that takes locusLinkId as input and prints the gene symbol, refseq mRNA, description, etc.

                mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D ce6 -e "select rl.locusLinkId, rl.name as geneName, rl.product as geneDescription, rl.mrnaAcc as refseqMrna, rl.protAcc as refseqProt from refLink rl where rl.locusLinkId=174288;"
                The bummer is that you have to tell it to use "ce6" -- it isn't generic enough to sniff out what organism and version to use a priori. But you'll know which one to use right? :-) And you can of course change the "=174288" to "IN (174288, 174289,174290)" for more of a bulk-input-experience, depending upon what you need. If you end up batch-scripting some geneID conversions, I'd definitely use the "IN" clause instead of querying them one-by-one. Markedly faster.

                DAVID is in theory a great resource, but could be opened up to increase the API limits, or to allow direct data downloads.

                Comment


                • #9
                  Thank you all guys

                  Comment


                  • #10
                    How to do the opposite?

                    Originally posted by peachgil View Post
                    In Bioconductor, just use the following codes:

                    > library(org.Hs.eg.db)
                    > library(annotate)
                    > lookUp('3815', 'org.Hs.eg', 'SYMBOL')
                    $`3815`
                    [1] "KIT"

                    > lookUp('3815', 'org.Hs.eg', 'REFSEQ')
                    $`3815`
                    [1] "NM_000222" "NM_001093772" "NP_000213" "NP_001087241"
                    I have a set of HGNC gene symbols, and I want to convert them to Entrez Gene IDs.

                    Thanks much!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X