Hi,
I would like to create a personal blast database of arbitrary sequences and be able to use all the features of BLAST+ to create subsets of databases based on identifiers or filter based on taxonomy.
It looks like the formatting of the definition line in the input FASTA files is crucial to assign proper sequence identifiers.
Using the General database identifier gnl|database|identifier or local identifier format lcl|identifier I wasn't able to use the blastdb_aliastool to create db subsets as it expects a GI list as input. I also didn't have any luck assigning taxonomy identifiers with the -taxid_map option of makeblastdb.
What is the recommended way to format FASTA definition lines in order to be able to use all the filtering features of the BLAST+ tools.
I was thinking of creating pseudo GenBank definitions for all my sequences: gi|<gi-number>|gb|<AccessionVersion>|<Accession>, where <gi-number> is a generated numeric value, and <Accession/Version> is my identifier. This works for the GI based filtering, however it seems like an ugly hack and I would prefer something more straight forward.
How is the taxid_map file formatted? I've tried <gi-number>, <gb|<AccessionVersion>>, or <gb|Accession> as the sequence identifier, however they don't seem to be assigned properly and blastdbcmd with -outfmt %T gives me zero for all entries.
Thanks for any help,
Deniz
PS: I've posted this already in http://I.SEQanswers.com but as it's still beta traffic is a bit low and I'm probably more likely to get an answer here in the forum.
I would like to create a personal blast database of arbitrary sequences and be able to use all the features of BLAST+ to create subsets of databases based on identifiers or filter based on taxonomy.
It looks like the formatting of the definition line in the input FASTA files is crucial to assign proper sequence identifiers.
Using the General database identifier gnl|database|identifier or local identifier format lcl|identifier I wasn't able to use the blastdb_aliastool to create db subsets as it expects a GI list as input. I also didn't have any luck assigning taxonomy identifiers with the -taxid_map option of makeblastdb.
What is the recommended way to format FASTA definition lines in order to be able to use all the filtering features of the BLAST+ tools.
I was thinking of creating pseudo GenBank definitions for all my sequences: gi|<gi-number>|gb|<AccessionVersion>|<Accession>, where <gi-number> is a generated numeric value, and <Accession/Version> is my identifier. This works for the GI based filtering, however it seems like an ugly hack and I would prefer something more straight forward.
How is the taxid_map file formatted? I've tried <gi-number>, <gb|<AccessionVersion>>, or <gb|Accession> as the sequence identifier, however they don't seem to be assigned properly and blastdbcmd with -outfmt %T gives me zero for all entries.
Thanks for any help,
Deniz
PS: I've posted this already in http://I.SEQanswers.com but as it's still beta traffic is a bit low and I'm probably more likely to get an answer here in the forum.
Comment