Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to retrieve an organism's whole proteome from NCBI

    HI.
    I am suffering from a problem from retrieving the whole proteome dataset from NCBI for a while. Now I only have the taxonomic id of the organism (txid684364), and when I use the batch entrez of NCBI (http://www.ncbi.nlm.nih.gov/protein/...anism:noexp%5D), only part of the protein dataset was downloaded to the local computer. However, previously it worked well when I retrieved several other genome proteome.

    Would anyone please give any solution to resolve this problem? I tried using efetch, however, I am confused of the command lines, Could anyone please teach me how to use the efetch by taking this organism(http://www.ncbi.nlm.nih.gov/protein/?txid684364) as examples to retrieve its whole genome proteins data?
    Thanks!

  • #2
    After you search with the txid on the protein page http://www.ncbi.nlm.nih.gov/protein

    Go to "Display Settings" drop-down, choose "FASTA" or format you need.

    Then go to "Send to" drop-down on the right and then choose "Destination" as "File". Finally click on "create file".

    I can see 8706 items.
    Attached Files

    Comment


    • #3
      Originally posted by GenoMax View Post
      After you search with the txid on the protein page http://www.ncbi.nlm.nih.gov/protein

      Go to "Display Settings" drop-down, choose "FASTA" or format you need.

      Then go to "Send to" drop-down on the right and then choose "Destination" as "File". Finally click on "create file".

      I can see 8706 items.
      Thank you GenoMax!
      I did that but after clicking on "create file", an empty sequence.fasta file will be automatically downloaded. And there was an sentence said "Your session has expired. Please repeat your search" inside the file.
      Have you succeeded in getting the right fasta file?

      Comment


      • #4
        Originally posted by Tsuyoshi View Post
        Thank you GenoMax!
        I did that but after clicking on "create file", an empty sequence.fasta file will be automatically downloaded. And there was an sentence said "Your session has expired. Please repeat your search" inside the file.
        Have you succeeded in getting the right fasta file?
        The first time around I had not done a complete download but after your post I did. I do get a FASTA file but it had only ~800 sequences in it (nowhere close to 8700 shown on the search page).

        I next tried Genepept format download. That got me a file with 5706 matches for "LOCUS". Still not 8706 items but closer.

        You may want to contact NCBI help desk if the genpept download is not adequate for your needs.

        Comment


        • #5
          Originally posted by GenoMax View Post
          The first time around I had not done a complete download but after your post I did. I do get a FASTA file but it had only ~800 sequences in it (nowhere close to 8700 shown on the search page).

          I next tried Genepept format download. That got me a file with 5706 matches for "LOCUS". Still not 8706 items but closer.

          You may want to contact NCBI help desk if the genpept download is not adequate for your needs.
          Yes, Thanks GenoMax, I neither got the full 8706 sequences. I am going to try another methods. Thank you again.

          Comment


          • #6
            You could also directly access the ftp site: ftp://ftp.ncbi.nih.gov/genomes/

            From there, you can go to the folder for your organism and look for the the protein information folder/file and retrieve the protein.fa

            You can point and click to this page or you can use command line tools [in unix/linux or mac] such as wget or cURL to retrieve the file.

            Additionally, you could use ensembl (either the ftp site; ftp://ftp.ensembl.org/pub/ or use bioMart to retrieve the information; http://www.ensembl.org/biomart/martview/)

            Comment


            • #7
              Originally posted by d1antho View Post
              You could also directly access the ftp site: ftp://ftp.ncbi.nih.gov/genomes/

              From there, you can go to the folder for your organism and look for the the protein information folder/file and retrieve the protein.fa

              You can point and click to this page or you can use command line tools [in unix/linux or mac] such as wget or cURL to retrieve the file.

              Additionally, you could use ensembl (either the ftp site; ftp://ftp.ensembl.org/pub/ or use bioMart to retrieve the information; http://www.ensembl.org/biomart/martview/)
              The organism (Batrachochytrium dendrobatidis JAM81) Tsuyoshi is looking for is not available at NCBI genomes site. It sounds like a chytrid so it may not be on main ensembl site either.

              Comment


              • #8
                Originally posted by Tsuyoshi View Post
                Yes, Thanks GenoMax, I neither got the full 8706 sequences. I am going to try another methods. Thank you again.
                Looks like this Genome was sequenced by JGI.

                You can find their protein set here: ftp://ftp.jgi-psf.org/pub/JGI_data/B...teins.fasta.gz

                Parent page for the data for this genome is at: http://genome.jgi-psf.org/Batde5/Bat...nload.ftp.html

                Comment


                • #9
                  Originally posted by d1antho View Post
                  You could also directly access the ftp site: ftp://ftp.ncbi.nih.gov/genomes/

                  From there, you can go to the folder for your organism and look for the the protein information folder/file and retrieve the protein.fa

                  You can point and click to this page or you can use command line tools [in unix/linux or mac] such as wget or cURL to retrieve the file.

                  Additionally, you could use ensembl (either the ftp site; ftp://ftp.ensembl.org/pub/ or use bioMart to retrieve the information; http://www.ensembl.org/biomart/martview/)
                  Thank you very much d1antho. I would like to try your method for retrieving other proteomes dataset.

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    Looks like this Genome was sequenced by JGI.

                    You can find their protein set here: ftp://ftp.jgi-psf.org/pub/JGI_data/B...teins.fasta.gz

                    Parent page for the data for this genome is at: http://genome.jgi-psf.org/Batde5/Bat...nload.ftp.html
                    Thank you so much GenoMax, and yes I downloaded the protein dataset of Batrachochytrium dendrobatidis from JGI. The fasta file contains the sequences, however, the title of each sequence begins with jgi format, which would bring problems for the BLASTP step, since I want to compare the protein datasets between my own proteomics data and Batrachochytrium dendrobatidis proteomes.

                    Anyway, I figured out an alternative method to retrieve the protein dataset from NCBI. By using the url (http://eutils.ncbi.nlm.nih.gov/entre...tmode=text&id=) and adding the GI list (maximum number is around 800 sequences for this method) after that url. Just paste the url into the web browser the corresponding sequences in fasta format will be automatically downloaded. Although it sounds time consuming, I finally got the dataset I wanted.

                    Thank you again for your kind reply.

                    Comment


                    • #11
                      Hi Tsuyoshi,
                      The broad institute have a genome for batrachochytrium_dendrobatidis: http://www.broadinstitute.org/annota...Downloads.html

                      Project and release information is here:


                      Probably a day late but I hope this helps anyway

                      Comment


                      • #12
                        Hi everyone,
                        I have a similar problem. I have transcript IDs from JGI but I need ensemble, entrez or GI IDs to run a analysis with KOBAS. I'd rather not search for all 13490 genes manually in the NCBI database and was wondering if someone knows an easy way to get the matching IDs. The organism I'm working with is Thalassiosira pseudonana. There are also KEGG IDs available but they are also in JGI format or EC numbers which KOBAS does not seem to support.

                        Does anyone know a neat way to solve my problem?

                        Thanks!

                        Comment


                        • #13
                          Originally posted by padmoo View Post
                          Hi everyone,
                          I have a similar problem. I have transcript IDs from JGI but I need ensemble, entrez or GI IDs to run a analysis with KOBAS. I'd rather not search for all 13490 genes manually in the NCBI database and was wondering if someone knows an easy way to get the matching IDs. The organism I'm working with is Thalassiosira pseudonana. There are also KEGG IDs available but they are also in JGI format or EC numbers which KOBAS does not seem to support.

                          Does anyone know a neat way to solve my problem?

                          Thanks!
                          If JGI has not made the mappings available then there may be no easy way. NCBI does have a GFF file available (http://www.ncbi.nlm.nih.gov/genome/54) but you probably can't use it as is.

                          Comment


                          • #14
                            Hi GenoMax,

                            thanks for the link to the NCBI gff! I tried to find this but was unsuccessful.

                            I do have a gff file from JGI, so it shouldn't be a problem to match those with the NCBI file.

                            Comment


                            • #15
                              Originally posted by padmoo View Post
                              Hi GenoMax,

                              thanks for the link to the NCBI gff! I tried to find this but was unsuccessful.

                              I do have a gff file from JGI, so it shouldn't be a problem to match those with the NCBI file.
                              Good. As long as you have a common "key" to anchor the two files you should be able to map the ID's.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X