Seqanswers Leaderboard Ad

**GenoMax** · 09-08-2014, 03:33 AM

Can you narrow your search to viruses of interest or are you truly looking to download *every* virus sequence known?

**fefe89** · 09-08-2014, 03:35 AM

Originally posted by GenoMax View Post

Can you narrow your search to viruses of interest or are you truly looking to download *every* virus sequence known?

Unfortunatly I need every viral sequence known. I have to create a sort of viral database.

**fefe89** · 09-08-2014, 04:05 AM

I spent the morning trying to downloading it in the classical way, but the average speed is aroung 30/40 kb/sec.

It is not a problem of my internet connection

**GenoMax** · 09-08-2014, 04:13 AM

As these things go there would be more than one way of doing this.

Get the "nt" sequence fie from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. You should be able to grep out the sequences containing "virus" in sequence name into a different file. I have not tested this but should work.

I will post a different solution below but that would need the "nt" blast database.

**GenoMax** · 09-08-2014, 04:17 AM

If you have access to pre-formatted "nt" blast database then the following will work (you can get the database from this link: ftp://ftp.ncbi.nlm.nih.gov/blast/db/. There are multiple files for nt* and you will need to get all of them). You will also need the blast+ program suite from NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/exe...blast+/LATEST/). It will take a while to run this command (depending on hardware you have access to).

Code:

$ blastdbcmd -db /path_to/nt -entry all -outfmt "%f" | grep "virus" | awk -F'|' '{print $2}' | blastdbcmd -db /path_to/nt -entry_batch - -out virus_sequence.fasta

**fefe89** · 09-08-2014, 04:23 AM

Originally posted by GenoMax View Post

As these things go there would be more than one way of doing this.

Get the "nt" sequence fie from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. You should be able to grep out the sequences containing "virus" in sequence name into a different file. I have not tested this but should work.

I will post a different solution below but that would need the "nt" blast database.

Thank you for your reply

Well, actually I made a search using the taxa ID that belong to viruses. I have a doubt about your grep-solution: a lot of phages present the word "phage" in the name sequence instead of virus.

Ok, I can make 2 different grep, but seems like a dirty solution. I mean, I actually don't know how many are the viruses that in the sequence name do not present "virus" or "phage" words. Am I too paranoic?

Are there no other solutions to speed up the download from genebank?

**GenoMax** · 09-08-2014, 05:01 AM

There are various caveats to the grep since as you point out you may get things that shouldn't be there and miss others you want.

The blastdbcmd is supposed to be able to search based on the taxid but that part is not working (taxid: 10239 viruses).

Here are the RefSeq releases for all viral/viroid sequences: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/. You would want the "fna" file.

I don't think there is any way to speed up the NCBI download (problem may be because you are in Europe). Have you tried to get the sequences from European database?

**fefe89** · 09-08-2014, 05:18 AM

Originally posted by GenoMax View Post

There are various caveats to the grep since as you point out you may get things that shouldn't be there and miss others you want.

The blastdbcmd is supposed to be able to search based on the taxid but that part is not working (taxid: 10239 viruses).

Here are the RefSeq releases for all viral/viroid sequences: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/. You would want the "fna" file.

Thank you Max

Since I'm working on environmental metagenomic data, using the RefSeq file could not be the best solution because I will assign my sequences at only model organisms (more or less).

But you gave me a great idea. I checked in the genbank ftp database and particular this

ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/

If i take all the gbvrl* ( VRL - viral sequences) I should be able to create a viral protein database. What do you think? It should work...

**GenoMax** · 09-08-2014, 05:55 AM

That should work. If you want DNA sequence they you can get all the "gbvrl*" files from here: ftp://ftp.ncbi.nih.gov/genbank/.

**maubp** · 09-09-2014, 08:26 AM

Originally posted by fefe89 View Post

Unfortunatly I need every viral sequence known. I have to create a sort of viral database.

You might be better off scripting this using the NCBI Entrez API (or their Entrez command line tools), see for example:

Trouble with chimeras - getting all complete viral genomes from the NCBI

http://blastedbio.blogspot.co.uk/2013/11/entrez-trouble-with-chimeras.html

Back in 2009, I wrote some Python scripts to use the NCBI Entrez Utilities to search for and download all known complete virus genomes in Ge...

However, the problem of detecting a partial FASTA file remains. One advantage of downloading in GenBank format is partial records are easy to spot (and you could convert GenBank to FASTA locally).

**bt27uk** · 09-09-2014, 09:35 AM

EMBL provides fasta files for database sections

For viral nucleotide sequences in fasta format, you could also go to the EMBL ftp site, specifically:

EMBL release: ftp://ftp.ebi.ac.uk/pub/databases/fa...rel_std_vrl.gz

EMBL updates: ftp://ftp.ebi.ac.uk/pub/databases/fa...cum_std_vrl.gz

If you go the directory level, you can see there are other files containing viral sequence as well (the files with "vrl" in their title). To read about the meaning of the filenames, check out the README info at:

ftp://ftp.ebi.ac.uk/pub/databases/embl/README

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Problem downloading fasta sequence from Genbank

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News