SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How does NCBI populate data fro gene entries? I want to get all refseq mRNA.. beeman Bioinformatics 3 05-11-2014 04:36 PM
Get Promoter Sequence for List of RefSeq IDs Fernas Bioinformatics 1 07-29-2013 03:14 PM
NCBI Genbank and Reference Sequence rahbz Bioinformatics 3 06-22-2013 02:12 AM
Download RefSeq .gb files based on accession number thedamian Bioinformatics 4 12-13-2012 04:31 AM
NCBI RefSeq "unclassified transcription discrepancy" husamia Bioinformatics 0 01-04-2012 11:48 AM

Reply
 
Thread Tools
Old 12-08-2014, 10:43 AM   #1
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default NCBI Reference Sequence ID to refseq accession

Hi,
From a NCBI Reference protein Sequence ID (starts for YP), how is it possible to automatically get the refseq genome accession ID (starts from NC_) if we want to do the matching for many sequences (therefore, not through the NCBI website)?

Regards,

Carol
carolW is offline   Reply With Quote
Old 12-08-2014, 11:44 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

There may be another way of doing this. One solution:

Do YP accessions refer to bacterial sequences? You can get corresponding "gi" ID's from the "faa" files here: ftp://ftp.ncbi.nih.gov/refseq/release/bacteria/

The gi ID's can then be mapped to the NC* from *genomic* files in the same directory.
GenoMax is offline   Reply With Quote
Old 12-08-2014, 12:56 PM   #3
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

The grande flat text file "gene2accession" from NCBI has this information.

There are many other interesting files in the directory of this file ( ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ . ) and they are updated frequently.
There is a README file which helps explain the data thereabouts ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/README

The URL is for gene2accession is ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

Command to get it is : wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

or use a browser.

Be sure to "gzip -d filename" to ungzip the file

_____


The "YP" is RNA_nucleotide_accession.version in column 6 and the "NC" is protein_accession.version in column 8


the gory details ...

The header is this ...

-bash-4.1$ head -1 gene2accession
#Format: tax_id GeneID status RNA_nucleotide_accession.version RNA_nucleotide_gi protein_accession.version protein_gi genomic_nucleotide_accession.version genomic_nucleotide_gi start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation assembly mature_peptide_accession.version mature_peptide_gi Symbol (tab is used as a separator, pound sign - start of a comment)

"YPs" look like this ...
-bash-4.1$ grep YP_ gene2accession | head
9 8655732 PROVISIONAL - - YP_003329478.1 270208711 NC_013549.1 270208709 1111 2502 + - - - leuC
9 8655733 PROVISIONAL - - YP_003329479.1 270208712 NC_013549.1 270208709 2560 3162 + - - - leuD
9 8655734 PROVISIONAL - - YP_003329480.1 270208713 NC_013549.1 270208709 3488 5035 + - - - leuA
9 8655735 PROVISIONAL - - YP_003329481.1 270208714 NC_013549.1 270208709 5466 6209 + - - - repA
9 8655736 PROVISIONAL - - YP_003329477.1 270208710 NC_013549.1 270208709 14 1111 + - - - leuB
9 20468915 PROVISIONAL - - YP_009062868.1 690387890 NC_025017.1 690387888 2298 2882 + - - - trpG
9 20468916 PROVISIONAL - - YP_009062867.1 690387889 NC_025017.1 690387888 0 1580 + - - - trpE
33 5961931 PROVISIONAL - - YP_001691218.1 169302958 NC_010372.1 169302939 15822 16589 - - - - pMF1.19c
33 5961932 PROVISIONAL - - YP_001691211.1 169302951 NC_010372.1 169302939 10004 11044 + - - - pMF1.12
33 5961933 PROVISIONAL - - YP_001691221.1 169302961 NC_010372.1 169302939 17650 18333 + - - - pMF1.22

Last edited by Richard Finney; 12-08-2014 at 01:12 PM.
Richard Finney is offline   Reply With Quote
Old 12-08-2014, 05:02 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Thanks for sharing that Richard. Learned something new.

Is this file continually updated?
GenoMax is offline   Reply With Quote
Old 12-08-2014, 05:23 PM   #5
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Theoretically these files are re-genetreated daily; though sometimes the actual contents don't change.

Using a little script-fu you can do things like create a GO term counts file for a set of gene inputs; just to get some bearings. Theres ENSEMBL to gene Ref/HUGO lookups too which comes in handy when dealing with "European oriented" software ike Deseq2.
Not that there's anything wrong with using default deseq annotation files. .
Richard Finney is offline   Reply With Quote
Old 12-09-2014, 01:04 AM   #6
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default

very nice and practical.

Can I grep a protein ID to this file gene2accession? Will I not have 2 prot ID that will be extracted by grep if they have the same pattern for ex they end by 1, 10, 100 etc?

Many thx
carolW is offline   Reply With Quote
Old 12-09-2014, 07:34 AM   #7
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Correct. Grepping is a problem unless the desired string match is unique.

Rolling your own ""match lines with items in this string set with items in that column" is a right of passage in the business.

Whether you can most easily do this in python/perl/java/c or a bash script using standard utils is an open question.

Last edited by Richard Finney; 12-09-2014 at 07:36 AM.
Richard Finney is offline   Reply With Quote
Old 12-09-2014, 10:41 AM   #8
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

DESeq is database agnostic. Although I like "European oriented"

e.g. in our demo data package, airway,

http://bioconductor.org/packages/rel...oc/airway.html

...just replace this line:

Code:
txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
with

Code:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
Michael Love is offline   Reply With Quote
Old 12-09-2014, 11:58 PM   #9
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default

Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?
carolW is offline   Reply With Quote
Old 12-10-2014, 12:10 AM   #10
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by carolW View Post
Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?
Not sure, but you can get this information with Entrez Direct, e.g. for this and this proteins, the query would be:


Code:
efetch -db protein -id 195954015,553836951 -format docsum | xtract -element Slen | tr "\t" "\n" 
225
74
With nucleotides, db would be "nuccore"..
__________________
savetherhino.org

Last edited by rhinoceros; 12-10-2014 at 12:25 AM.
rhinoceros is offline   Reply With Quote
Old 12-10-2014, 12:25 AM   #11
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default

if I have a set of IDs, what would be the file to search in?
carolW is offline   Reply With Quote
Old 12-10-2014, 12:27 AM   #12
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by carolW View Post
if I have a set of IDs, what would be the file to search in?
I don't understand your question
__________________
savetherhino.org
rhinoceros is offline   Reply With Quote
Old 12-10-2014, 05:30 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by carolW View Post
Does NCBI have any file that indicates the length of aa or nt of sequences, proteomic or genomic?
File Richard referred to has the genomic coordinates.

Quote:
start position on the genomic accession:
position of the gene feature on the genomic accession,
'-' if not applicable
position 0-based

end position on the genomic accession:
position of the gene feature on the genomic accession,
'-' if not applicable
position 0-based
If you are dealing with bacterial ORF's then coverting that to AA lengths should be easy.

Otherwise rhinoceros posted a programmatic way you can get that information directly from NCBI. You would need to iterate through your ID's.
GenoMax is offline   Reply With Quote
Old 01-30-2015, 12:21 AM   #14
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default

As proteins whose ID starting WP_ are not in this file, how to find the info for these proteins?
carolW is offline   Reply With Quote
Old 01-30-2015, 03:04 AM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by carolW View Post
As proteins whose ID starting WP_ are not in this file, how to find the info for these proteins?
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

Look for files with *non_redundant* in names.

Perhaps Richard knows of a file where this information is in one spot.
GenoMax is offline   Reply With Quote
Old 02-02-2015, 02:01 PM   #16
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

WP_ records are there ...

-bash-4.1$ head -1 gene2refseq
#Format: tax_id GeneID status RNA_nucleotide_accession.version RNA_nucleotide_gi protein_accession.version protein_gi genomic_nucleotide_accession.version genomic_nucleotide_gi start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation assembly mature_peptide_accession.version mature_peptide_gi Symbol (tab is used as a separator, pound sign - start of a comment)
-bash-4.1$ grep WP_033716180 gene2refseq
526972 22940199 NA - - WP_033716180.1 727256867 NZ_CM000719.1 238801471 5658661 5659747 + - - - BCERE0007_RS28405
-bash-4.1$ ls -l gene2refseq
-rw-r--r--. 1 finneyr 1007 3066838395 Feb 2 17:52 gene2refseq
Richard Finney is offline   Reply With Quote
Reply

Tags
ncbi ref seq id, refseq accession

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO