Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • NCBI Reference Sequence ID to refseq accession

    Hi,
    From a NCBI Reference protein Sequence ID (starts for YP), how is it possible to automatically get the refseq genome accession ID (starts from NC_) if we want to do the matching for many sequences (therefore, not through the NCBI website)?

    Regards,

    Carol

  • #2
    There may be another way of doing this. One solution:

    Do YP accessions refer to bacterial sequences? You can get corresponding "gi" ID's from the "faa" files here: ftp://ftp.ncbi.nih.gov/refseq/release/bacteria/

    The gi ID's can then be mapped to the NC* from *genomic* files in the same directory.

    Comment


    • #3
      The grande flat text file "gene2accession" from NCBI has this information.

      There are many other interesting files in the directory of this file ( ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ . ) and they are updated frequently.
      There is a README file which helps explain the data thereabouts ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/README

      The URL is for gene2accession is ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

      Command to get it is : wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

      or use a browser.

      Be sure to "gzip -d filename" to ungzip the file

      _____


      The "YP" is RNA_nucleotide_accession.version in column 6 and the "NC" is protein_accession.version in column 8


      the gory details ...

      The header is this ...

      -bash-4.1$ head -1 gene2accession
      #Format: tax_id GeneID status RNA_nucleotide_accession.version RNA_nucleotide_gi protein_accession.version protein_gi genomic_nucleotide_accession.version genomic_nucleotide_gi start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation assembly mature_peptide_accession.version mature_peptide_gi Symbol (tab is used as a separator, pound sign - start of a comment)

      "YPs" look like this ...
      -bash-4.1$ grep YP_ gene2accession | head
      9 8655732 PROVISIONAL - - YP_003329478.1 270208711 NC_013549.1 270208709 1111 2502 + - - - leuC
      9 8655733 PROVISIONAL - - YP_003329479.1 270208712 NC_013549.1 270208709 2560 3162 + - - - leuD
      9 8655734 PROVISIONAL - - YP_003329480.1 270208713 NC_013549.1 270208709 3488 5035 + - - - leuA
      9 8655735 PROVISIONAL - - YP_003329481.1 270208714 NC_013549.1 270208709 5466 6209 + - - - repA
      9 8655736 PROVISIONAL - - YP_003329477.1 270208710 NC_013549.1 270208709 14 1111 + - - - leuB
      9 20468915 PROVISIONAL - - YP_009062868.1 690387890 NC_025017.1 690387888 2298 2882 + - - - trpG
      9 20468916 PROVISIONAL - - YP_009062867.1 690387889 NC_025017.1 690387888 0 1580 + - - - trpE
      33 5961931 PROVISIONAL - - YP_001691218.1 169302958 NC_010372.1 169302939 15822 16589 - - - - pMF1.19c
      33 5961932 PROVISIONAL - - YP_001691211.1 169302951 NC_010372.1 169302939 10004 11044 + - - - pMF1.12
      33 5961933 PROVISIONAL - - YP_001691221.1 169302961 NC_010372.1 169302939 17650 18333 + - - - pMF1.22
      Last edited by Richard Finney; 12-08-2014, 02:12 PM.

      Comment


      • #4
        Thanks for sharing that Richard. Learned something new.

        Is this file continually updated?

        Comment


        • #5
          Theoretically these files are re-genetreated daily; though sometimes the actual contents don't change.

          Using a little script-fu you can do things like create a GO term counts file for a set of gene inputs; just to get some bearings. Theres ENSEMBL to gene Ref/HUGO lookups too which comes in handy when dealing with "European oriented" software ike Deseq2.
          Not that there's anything wrong with using default deseq annotation files. .

          Comment


          • #6
            very nice and practical.

            Can I grep a protein ID to this file gene2accession? Will I not have 2 prot ID that will be extracted by grep if they have the same pattern for ex they end by 1, 10, 100 etc?

            Many thx

            Comment


            • #7
              Correct. Grepping is a problem unless the desired string match is unique.

              Rolling your own ""match lines with items in this string set with items in that column" is a right of passage in the business.

              Whether you can most easily do this in python/perl/java/c or a bash script using standard utils is an open question.
              Last edited by Richard Finney; 12-09-2014, 08:36 AM.

              Comment


              • #8
                DESeq is database agnostic. Although I like "European oriented"

                e.g. in our demo data package, airway,



                ...just replace this line:

                Code:
                txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
                with

                Code:
                library(TxDb.Hsapiens.UCSC.hg19.knownGene)
                txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

                Comment


                • #9
                  Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?

                  Comment


                  • #10
                    Originally posted by carolW View Post
                    Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?
                    Not sure, but you can get this information with Entrez Direct, e.g. for this and this proteins, the query would be:


                    Code:
                    efetch -db protein -id 195954015,553836951 -format docsum | xtract -element Slen | tr "\t" "\n" 
                    225
                    74
                    With nucleotides, db would be "nuccore"..
                    Last edited by rhinoceros; 12-10-2014, 01:25 AM.
                    savetherhino.org

                    Comment


                    • #11
                      if I have a set of IDs, what would be the file to search in?

                      Comment


                      • #12
                        Originally posted by carolW View Post
                        if I have a set of IDs, what would be the file to search in?
                        I don't understand your question
                        savetherhino.org

                        Comment


                        • #13
                          Originally posted by carolW View Post
                          Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?
                          File Richard referred to has the genomic coordinates.

                          start position on the genomic accession:
                          position of the gene feature on the genomic accession,
                          '-' if not applicable
                          position 0-based

                          end position on the genomic accession:
                          position of the gene feature on the genomic accession,
                          '-' if not applicable
                          position 0-based
                          If you are dealing with bacterial ORF's then coverting that to AA lengths should be easy.

                          Otherwise rhinoceros posted a programmatic way you can get that information directly from NCBI. You would need to iterate through your ID's.

                          Comment


                          • #14
                            As proteins whose ID starting WP_ are not in this file, how to find the info for these proteins?

                            Comment


                            • #15
                              Originally posted by carolW View Post
                              As proteins whose ID starting WP_ are not in this file, how to find the info for these proteins?
                              ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

                              Look for files with *non_redundant* in names.

                              Perhaps Richard knows of a file where this information is in one spot.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X