Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem downloading fasta sequence from Genbank

    I have to download a very large amount of fasta sequence ( more than two milions) from genbank.

    As suggest me in another post I performed a query using my keyword (in my case VIRUS) to find all the viral sequences and once I got the results, I selected in the dropdown menu:
    - send to
    - File
    - format FASTA
    - create file

    The problem is that the download is very very slow and sometimes it fails in the middle. I also don't know the size of the file so I don't know if the downlad is complete

    Is there another way to download this file?

    Thank you guys

  • #2
    Can you narrow your search to viruses of interest or are you truly looking to download *every* virus sequence known?

    Comment


    • #3
      Originally posted by GenoMax View Post
      Can you narrow your search to viruses of interest or are you truly looking to download *every* virus sequence known?
      Unfortunatly I need every viral sequence known. I have to create a sort of viral database.

      Comment


      • #4
        I spent the morning trying to downloading it in the classical way, but the average speed is aroung 30/40 kb/sec.

        It is not a problem of my internet connection

        Comment


        • #5
          As these things go there would be more than one way of doing this.

          Get the "nt" sequence fie from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. You should be able to grep out the sequences containing "virus" in sequence name into a different file. I have not tested this but should work.

          I will post a different solution below but that would need the "nt" blast database.

          Comment


          • #6
            If you have access to pre-formatted "nt" blast database then the following will work (you can get the database from this link: ftp://ftp.ncbi.nlm.nih.gov/blast/db/. There are multiple files for nt* and you will need to get all of them). You will also need the blast+ program suite from NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/exe...blast+/LATEST/). It will take a while to run this command (depending on hardware you have access to).

            Code:
            $ blastdbcmd -db /path_to/nt -entry all -outfmt "%f" | grep "virus" | awk -F'|' '{print $2}' | blastdbcmd -db /path_to/nt -entry_batch - -out virus_sequence.fasta

            Comment


            • #7
              Originally posted by GenoMax View Post
              As these things go there would be more than one way of doing this.

              Get the "nt" sequence fie from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. You should be able to grep out the sequences containing "virus" in sequence name into a different file. I have not tested this but should work.

              I will post a different solution below but that would need the "nt" blast database.
              Thank you for your reply

              Well, actually I made a search using the taxa ID that belong to viruses. I have a doubt about your grep-solution: a lot of phages present the word "phage" in the name sequence instead of virus.

              Ok, I can make 2 different grep, but seems like a dirty solution. I mean, I actually don't know how many are the viruses that in the sequence name do not present "virus" or "phage" words. Am I too paranoic?

              Are there no other solutions to speed up the download from genebank?
              Last edited by fefe89; 09-08-2014, 04:32 AM.

              Comment


              • #8
                There are various caveats to the grep since as you point out you may get things that shouldn't be there and miss others you want.

                The blastdbcmd is supposed to be able to search based on the taxid but that part is not working (taxid: 10239 viruses).

                Here are the RefSeq releases for all viral/viroid sequences: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/. You would want the "fna" file.

                I don't think there is any way to speed up the NCBI download (problem may be because you are in Europe). Have you tried to get the sequences from European database?
                Last edited by GenoMax; 09-08-2014, 05:05 AM.

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  There are various caveats to the grep since as you point out you may get things that shouldn't be there and miss others you want.

                  The blastdbcmd is supposed to be able to search based on the taxid but that part is not working (taxid: 10239 viruses).

                  Here are the RefSeq releases for all viral/viroid sequences: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/. You would want the "fna" file.
                  Thank you Max

                  Since I'm working on environmental metagenomic data, using the RefSeq file could not be the best solution because I will assign my sequences at only model organisms (more or less).

                  But you gave me a great idea. I checked in the genbank ftp database and particular this

                  ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/

                  If i take all the gbvrl* ( VRL - viral sequences) I should be able to create a viral protein database. What do you think? It should work...

                  Comment


                  • #10
                    That should work. If you want DNA sequence they you can get all the "gbvrl*" files from here: ftp://ftp.ncbi.nih.gov/genbank/.

                    Comment


                    • #11
                      Originally posted by fefe89 View Post
                      Unfortunatly I need every viral sequence known. I have to create a sort of viral database.
                      You might be better off scripting this using the NCBI Entrez API (or their Entrez command line tools), see for example:
                      Back in 2009, I wrote some Python scripts to use the NCBI Entrez Utilities to search for and download all known complete virus genomes in Ge...


                      However, the problem of detecting a partial FASTA file remains. One advantage of downloading in GenBank format is partial records are easy to spot (and you could convert GenBank to FASTA locally).

                      Comment


                      • #12
                        EMBL provides fasta files for database sections

                        For viral nucleotide sequences in fasta format, you could also go to the EMBL ftp site, specifically:

                        EMBL release: ftp://ftp.ebi.ac.uk/pub/databases/fa...rel_std_vrl.gz

                        EMBL updates: ftp://ftp.ebi.ac.uk/pub/databases/fa...cum_std_vrl.gz

                        If you go the directory level, you can see there are other files containing viral sequence as well (the files with "vrl" in their title). To read about the meaning of the filenames, check out the README info at:

                        ftp://ftp.ebi.ac.uk/pub/databases/embl/README

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        30 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        32 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X