SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
retrieve reads containing kmer jgibbons1 Bioinformatics 12 03-19-2013 09:50 AM
Translate ESTs using related proteome k-gun12 Bioinformatics 1 12-29-2012 07:18 PM
how to retrieve human rRNA annotation Xi Wang RNA Sequencing 11 03-25-2012 11:11 PM
How to retrieve un-aligned reads from Bowtie shuang Bioinformatics 1 10-17-2011 01:35 PM
retrieve gene name fabio25 Bioinformatics 19 05-07-2010 09:44 AM

Reply
 
Thread Tools
Old 04-11-2013, 01:11 AM   #1
Tsuyoshi
Member
 
Location: japan

Join Date: Sep 2012
Posts: 24
Unhappy How to retrieve an organism's whole proteome from NCBI

HI.
I am suffering from a problem from retrieving the whole proteome dataset from NCBI for a while. Now I only have the taxonomic id of the organism (txid684364), and when I use the batch entrez of NCBI (http://www.ncbi.nlm.nih.gov/protein/...anism:noexp%5D), only part of the protein dataset was downloaded to the local computer. However, previously it worked well when I retrieved several other genome proteome.

Would anyone please give any solution to resolve this problem? I tried using efetch, however, I am confused of the command lines, Could anyone please teach me how to use the efetch by taking this organism(http://www.ncbi.nlm.nih.gov/protein/?txid684364) as examples to retrieve its whole genome proteins data?
Thanks!
Tsuyoshi is offline   Reply With Quote
Old 04-11-2013, 03:06 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

After you search with the txid on the protein page http://www.ncbi.nlm.nih.gov/protein

Go to "Display Settings" drop-down, choose "FASTA" or format you need.

Then go to "Send to" drop-down on the right and then choose "Destination" as "File". Finally click on "create file".

I can see 8706 items.
Attached Images
File Type: png disp.PNG (23.0 KB, 2 views)
File Type: png save_file.PNG (8.8 KB, 3 views)
GenoMax is offline   Reply With Quote
Old 04-11-2013, 03:58 AM   #3
Tsuyoshi
Member
 
Location: japan

Join Date: Sep 2012
Posts: 24
Default

Quote:
Originally Posted by GenoMax View Post
After you search with the txid on the protein page http://www.ncbi.nlm.nih.gov/protein

Go to "Display Settings" drop-down, choose "FASTA" or format you need.

Then go to "Send to" drop-down on the right and then choose "Destination" as "File". Finally click on "create file".

I can see 8706 items.
Thank you GenoMax!
I did that but after clicking on "create file", an empty sequence.fasta file will be automatically downloaded. And there was an sentence said "Your session has expired. Please repeat your search" inside the file.
Have you succeeded in getting the right fasta file?
Tsuyoshi is offline   Reply With Quote
Old 04-11-2013, 04:36 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Quote:
Originally Posted by Tsuyoshi View Post
Thank you GenoMax!
I did that but after clicking on "create file", an empty sequence.fasta file will be automatically downloaded. And there was an sentence said "Your session has expired. Please repeat your search" inside the file.
Have you succeeded in getting the right fasta file?
The first time around I had not done a complete download but after your post I did. I do get a FASTA file but it had only ~800 sequences in it (nowhere close to 8700 shown on the search page).

I next tried Genepept format download. That got me a file with 5706 matches for "LOCUS". Still not 8706 items but closer.

You may want to contact NCBI help desk if the genpept download is not adequate for your needs.
GenoMax is offline   Reply With Quote
Old 04-11-2013, 04:46 AM   #5
Tsuyoshi
Member
 
Location: japan

Join Date: Sep 2012
Posts: 24
Default

Quote:
Originally Posted by GenoMax View Post
The first time around I had not done a complete download but after your post I did. I do get a FASTA file but it had only ~800 sequences in it (nowhere close to 8700 shown on the search page).

I next tried Genepept format download. That got me a file with 5706 matches for "LOCUS". Still not 8706 items but closer.

You may want to contact NCBI help desk if the genpept download is not adequate for your needs.
Yes, Thanks GenoMax, I neither got the full 8706 sequences. I am going to try another methods. Thank you again.
Tsuyoshi is offline   Reply With Quote
Old 04-11-2013, 05:11 AM   #6
d1antho
Member
 
Location: Ireland

Join Date: Mar 2012
Posts: 15
Default

You could also directly access the ftp site: ftp://ftp.ncbi.nih.gov/genomes/

From there, you can go to the folder for your organism and look for the the protein information folder/file and retrieve the protein.fa

You can point and click to this page or you can use command line tools [in unix/linux or mac] such as wget or cURL to retrieve the file.

Additionally, you could use ensembl (either the ftp site; ftp://ftp.ensembl.org/pub/ or use bioMart to retrieve the information; http://www.ensembl.org/biomart/martview/)
d1antho is offline   Reply With Quote
Old 04-11-2013, 05:22 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Quote:
Originally Posted by d1antho View Post
You could also directly access the ftp site: ftp://ftp.ncbi.nih.gov/genomes/

From there, you can go to the folder for your organism and look for the the protein information folder/file and retrieve the protein.fa

You can point and click to this page or you can use command line tools [in unix/linux or mac] such as wget or cURL to retrieve the file.

Additionally, you could use ensembl (either the ftp site; ftp://ftp.ensembl.org/pub/ or use bioMart to retrieve the information; http://www.ensembl.org/biomart/martview/)
The organism (Batrachochytrium dendrobatidis JAM81) Tsuyoshi is looking for is not available at NCBI genomes site. It sounds like a chytrid so it may not be on main ensembl site either.
GenoMax is offline   Reply With Quote
Old 04-11-2013, 05:27 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Quote:
Originally Posted by Tsuyoshi View Post
Yes, Thanks GenoMax, I neither got the full 8706 sequences. I am going to try another methods. Thank you again.
Looks like this Genome was sequenced by JGI.

You can find their protein set here: ftp://ftp.jgi-psf.org/pub/JGI_data/B...teins.fasta.gz

Parent page for the data for this genome is at: http://genome.jgi-psf.org/Batde5/Bat...nload.ftp.html
GenoMax is offline   Reply With Quote
Old 04-11-2013, 06:29 PM   #9
Tsuyoshi
Member
 
Location: japan

Join Date: Sep 2012
Posts: 24
Default

Quote:
Originally Posted by d1antho View Post
You could also directly access the ftp site: ftp://ftp.ncbi.nih.gov/genomes/

From there, you can go to the folder for your organism and look for the the protein information folder/file and retrieve the protein.fa

You can point and click to this page or you can use command line tools [in unix/linux or mac] such as wget or cURL to retrieve the file.

Additionally, you could use ensembl (either the ftp site; ftp://ftp.ensembl.org/pub/ or use bioMart to retrieve the information; http://www.ensembl.org/biomart/martview/)
Thank you very much d1antho. I would like to try your method for retrieving other proteomes dataset.
Tsuyoshi is offline   Reply With Quote
Old 04-11-2013, 06:47 PM   #10
Tsuyoshi
Member
 
Location: japan

Join Date: Sep 2012
Posts: 24
Default

Quote:
Originally Posted by GenoMax View Post
Looks like this Genome was sequenced by JGI.

You can find their protein set here: ftp://ftp.jgi-psf.org/pub/JGI_data/B...teins.fasta.gz

Parent page for the data for this genome is at: http://genome.jgi-psf.org/Batde5/Bat...nload.ftp.html
Thank you so much GenoMax, and yes I downloaded the protein dataset of Batrachochytrium dendrobatidis from JGI. The fasta file contains the sequences, however, the title of each sequence begins with jgi format, which would bring problems for the BLASTP step, since I want to compare the protein datasets between my own proteomics data and Batrachochytrium dendrobatidis proteomes.

Anyway, I figured out an alternative method to retrieve the protein dataset from NCBI. By using the url (http://eutils.ncbi.nlm.nih.gov/entre...tmode=text&id=) and adding the GI list (maximum number is around 800 sequences for this method) after that url. Just paste the url into the web browser the corresponding sequences in fasta format will be automatically downloaded. Although it sounds time consuming, I finally got the dataset I wanted.

Thank you again for your kind reply.
Tsuyoshi is offline   Reply With Quote
Old 04-12-2013, 05:03 AM   #11
d1antho
Member
 
Location: Ireland

Join Date: Mar 2012
Posts: 15
Default

Hi Tsuyoshi,
The broad institute have a genome for batrachochytrium_dendrobatidis: http://www.broadinstitute.org/annota...Downloads.html

Project and release information is here:
http://www.broadinstitute.org/annota...MultiHome.html

Probably a day late but I hope this helps anyway
d1antho is offline   Reply With Quote
Old 08-10-2015, 11:11 AM   #12
padmoo
Member
 
Location: Norwich

Join Date: Jun 2015
Posts: 16
Default

Hi everyone,
I have a similar problem. I have transcript IDs from JGI but I need ensemble, entrez or GI IDs to run a analysis with KOBAS. I'd rather not search for all 13490 genes manually in the NCBI database and was wondering if someone knows an easy way to get the matching IDs. The organism I'm working with is Thalassiosira pseudonana. There are also KEGG IDs available but they are also in JGI format or EC numbers which KOBAS does not seem to support.

Does anyone know a neat way to solve my problem?

Thanks!
padmoo is offline   Reply With Quote
Old 08-10-2015, 11:38 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Quote:
Originally Posted by padmoo View Post
Hi everyone,
I have a similar problem. I have transcript IDs from JGI but I need ensemble, entrez or GI IDs to run a analysis with KOBAS. I'd rather not search for all 13490 genes manually in the NCBI database and was wondering if someone knows an easy way to get the matching IDs. The organism I'm working with is Thalassiosira pseudonana. There are also KEGG IDs available but they are also in JGI format or EC numbers which KOBAS does not seem to support.

Does anyone know a neat way to solve my problem?

Thanks!
If JGI has not made the mappings available then there may be no easy way. NCBI does have a GFF file available (http://www.ncbi.nlm.nih.gov/genome/54) but you probably can't use it as is.
GenoMax is offline   Reply With Quote
Old 08-10-2015, 12:05 PM   #14
padmoo
Member
 
Location: Norwich

Join Date: Jun 2015
Posts: 16
Default

Hi GenoMax,

thanks for the link to the NCBI gff! I tried to find this but was unsuccessful.

I do have a gff file from JGI, so it shouldn't be a problem to match those with the NCBI file.
padmoo is offline   Reply With Quote
Old 08-10-2015, 12:14 PM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Quote:
Originally Posted by padmoo View Post
Hi GenoMax,

thanks for the link to the NCBI gff! I tried to find this but was unsuccessful.

I do have a gff file from JGI, so it shouldn't be a problem to match those with the NCBI file.
Good. As long as you have a common "key" to anchor the two files you should be able to map the ID's.
GenoMax is offline   Reply With Quote
Old 08-10-2015, 12:18 PM   #16
padmoo
Member
 
Location: Norwich

Join Date: Jun 2015
Posts: 16
Default

Yes, the JGI file has the chromosome coordinates too, so I just need to figure out how to match the two files now.
Thanks!
padmoo is offline   Reply With Quote
Reply

Tags
batch entrez, efetch

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO