Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mapping pathways to refseq IDs

    Hi,

    I have several annotated bacterial genomes, and would like to map pathway information to the coding sequences in each genome. In the past I've used blast2go to query KEGG, but no longer have access to this. So, I've been looking at free command line programs (mostly R based: reactomePA, KEGGREST, Metacyc tools looks good but don't think they have command line option?). However, KEGGREST and reactomePA require specific accessions as input (usually an Entrez Gene ID), and the only accessions present in my PGAAP-annotated file are refseq IDs (and a few SwissProt IDs). I've used several programs (e.g., MyGene.Info in Bioconductor) to convert the refseq IDs to Gene IDs, and have found that most of the refseq IDs do not map to any Gene IDs. So, how can I get pathway information for these sequences?

    Thanks!

  • #2
    There is an interesting file at NCBI that provides cross-mapping of various ID's: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

    Take a look to see if that has mappings you can use. BTW: What exactly do you mean by Gene ID's?

    Comment


    • #3
      Hi,

      Thanks for the reply. I had actually looked at the gene2refseq file from the same ftp site earlier. I searched it for a couple of the refseq IDs of interest and did not find them in the file. I just tried and got the same result for the gene2accession file. Entrez Gene IDs are IDs associated with the Genbank Gene database. I'm not exactly sure how the db is created, but it's highly curated in some way, and so I think the issue here is that there simply are not matches for every refseq ID in this database. I also thought about blasting my CDSs against the Gene DB to get a Gene ID where possible, but it seems like I should somehow be able to get pathway information from my NCBI annotated file?

      Thanks,
      Cary

      Comment


      • #4
        gene2accession file is regenerated each day and should cover all the data in genbank. Can you share the accession numbers you are looking at?

        Comment


        • #5
          Dear All,

          The information here sounds interesting. I also would like to do some investigations in pathway in different species. I have some candidate genes and also their corresponding gene sequences in different species, and I heard that people can use blast2go to get the GO information of the candidate genes, and then build the pathway by using KEGG which have already collecting the available information about which gene is involved in which pathway. So is that mean the pathway comparison among the candidate genes can be realized in blast2go with KEGG database? Or is there any other software can do this?

          Also, I would like to ask is the KEGG the best database for doing this? Is there any other database also include broad information which include Gene ontology, functional experiment results etc.?

          Thanks in advance!

          Best,

          Sadiexiaoyu
          Last edited by sadiexiaoyu; 07-07-2015, 01:10 AM.

          Comment


          • #6
            Hi,

            Here are 2 of the accession numbers I looked for using grep:
            WP_014091756.1
            WP_010958694.1
            That would be great if you could double check me - maybe I am missing something here.

            Carp

            Comment


            • #7
              Those accession numbers are referring to "RefSeq non-redundant proteins" which is a new record type introduced in 2013 (http://www.ncbi.nlm.nih.gov/refseq/a...ndantproteins/). These records don't point to a specific gene but the closest you are going to get is the protein clusters record.

              Only way I see of being able to pull information for those WP_* ID's is by using the blastdbcmd utility and nr blast database (adjust outfmt appropriately).

              Code:
              $ blastdbcmd -entry WP_014091756.1 -db /path_to/nr -outfmt '%a,%t'
              There are a couple of other free options for KEGG in this thread: http://seqanswers.com/forums/showthread.php?p=158023
              Last edited by GenoMax; 07-07-2015, 09:24 AM.

              Comment


              • #8
                Hi GenoMax,

                Thanks very much for the info. Helpful. I haven't used blastdbcmd before, and was just reading about it in the user manual. Could you explain a little bit about the output you might expect from this search?

                Thanks,
                Carp

                Comment


                • #9
                  Code:
                  $ blastdbcmd -entry WP_014091756.1 -db /real_path_to/nr -outfmt '%a,%t'
                  Gives you:

                  WP_014091756.1,hypothetical protein[Listeria ivanovii]
                  CBW84678.1,Putative transcription repressor of class III stress genes (CtsR)[Listeria ivanovii subsp. ivanovii PAM 55]
                  AHI54813.1,CtsR family transcriptional regulator[Listeria ivanovii WSLC3009]
                  AIS64276.1,CtsR family transcriptional regulator[Listeria ivanovii subsp. ivanovii]
                  Not something you can use directly (at least that is what I am guessing) but it at least tells you that this is CtsR gene.

                  Comment


                  • #10
                    Thank you so much, that helps!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 11:49 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    61 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X