SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Metagenomics (http://seqanswers.com/forums/forumdisplay.php?f=29)
-   -   Generate local blast database with RefSeq bacteria AND taxonomy (http://seqanswers.com/forums/showthread.php?t=80006)

evensrii 01-09-2018 12:39 AM

Generate local blast database with RefSeq bacteria AND taxonomy
 
Dear all,

I would like to be able to create my own custom local blast database, as this may be relevant in many different situations in bioinformatics. In this case, I hope to make a database containing all the latest versions of the bacterial genomes found in RefSeq. For starters, I have downloaded bacterial genomes (assemblies) from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria, using information in the "assembly_summary.txt" to fetch the latest genome versions only. As a result, I now have almost 104,000 files (one per bacterial genome) containing one or multiple contigs. So far, so good.

Each contig within a genome has a header containing the NCBI accession number ++, i.e.:

Genome (file) 1:

Code:

>NZ_NMDP01000102.1 Escherichia coli strain MOD1-EC6062

>NZ_NMDP01000103.1 Escherichia coli strain MOD1-EC6062

Genome (file) 2:

Code:

>NZ_NOBY01000102.1 Escherichia coli strain MOD1-EC5816

>NZ_NOBY01000115.1 Escherichia coli strain MOD1-EC5816

etc...

I now want to associate all genomes with a taxonomy (taxid?), as I understand this is important in many applications. For example, by blasting to my local database, I want to be able to quickly determine from which bacterium my blast query sequence originates.

My questions are therefore:

1. How do I find the taxon ID for all the bacterial genomes in question?

(Note: These are genomes from ../genomes/refseq/bacteria, not ..refseq/release/bacteria)?

2. How do I incorporate that information into my genome files and/or final local database?

I suspect I first have to link up the NCBI accession number in the headers to a taxon ID in some way, but I'm not sure how to do that, or in what format it should be.

All answers are highly appreciated! :)

Kind regards,

Even Sannes Riiser,

PhD candidate, University of Oslo, Norway


All times are GMT -8. The time now is 12:07 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.