Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BLAST+ creating custom blast database and using blast+ filtering features

    Hi,

    I would like to create a personal blast database of arbitrary sequences and be able to use all the features of BLAST+ to create subsets of databases based on identifiers or filter based on taxonomy.

    It looks like the formatting of the definition line in the input FASTA files is crucial to assign proper sequence identifiers.

    Using the General database identifier gnl|database|identifier or local identifier format lcl|identifier I wasn't able to use the blastdb_aliastool to create db subsets as it expects a GI list as input. I also didn't have any luck assigning taxonomy identifiers with the -taxid_map option of makeblastdb.

    What is the recommended way to format FASTA definition lines in order to be able to use all the filtering features of the BLAST+ tools.

    I was thinking of creating pseudo GenBank definitions for all my sequences: gi|<gi-number>|gb|<AccessionVersion>|<Accession>, where <gi-number> is a generated numeric value, and <Accession/Version> is my identifier. This works for the GI based filtering, however it seems like an ugly hack and I would prefer something more straight forward.

    How is the taxid_map file formatted? I've tried <gi-number>, <gb|<AccessionVersion>>, or <gb|Accession> as the sequence identifier, however they don't seem to be assigned properly and blastdbcmd with -outfmt %T gives me zero for all entries.

    Thanks for any help,

    Deniz

    PS: I've posted this already in http://I.SEQanswers.com but as it's still beta traffic is a bit low and I'm probably more likely to get an answer here in the forum.

  • #2
    Hi,
    I was wondering if you found a solution for this? I have been having all kinds of problems creating subset blast databases.
    Thanks

    Comment


    • #3
      GI lists are essentially simple to create, if you have experience with bash, perl, or any programming language that you can use to format text, you may be able to automatically pull the GIs from these files and put them into a GI list. A GI list is simply a text file with one number per line, and each number is a GI. There may be some utilities that do this automatically (i.e. FASTA->GI list), though I don't know of them.

      Alternatively, a nice way I've found for creating GI lists based on queries is to query NCBI for the data you want. You can do a search in NCBI, click send to on the top-right corner, and export it as a GI list. This is a nice, easy way of getting a GI list for blast subsets, though it's difficult to automate.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      18 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      22 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      46 views
      0 likes
      Last Post seqadmin  
      Working...
      X