Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • miguelangel
    Member
    • Jun 2012
    • 16

    How can I format RDP database to be used in a BLAST search?

    Hello there and thank you for your welcome!

    I have to format RDP 16S bacterial database in fasta format (downladed from here:http://rdp.cme.msu.edu/misc/resources.jsp) to fit in a BLAST search, carried out with QIIME command 'assign_taxonomy.py'.

    I need to create an index from the fasta file and I have read that this can be done by using 'formatdb' in the BLAST standsalone program, but when I try to do it I always get a message like this one:

    [formatdb 2.2.22] ERROR: RDP_11_2_index.txt.nhrOutput
    Blast-def-line-set.E.<title>
    Invalid value(s) [9] in VisibleString [uncultured bacterium; DolOr_72351#Lineage=Root;rootrank;Bacteria;domain;unclassified_Bacteria; ...]

    However, I get a .nhr file, but with data in this shape:

    S000655540¢Ä0Ä0ĆÄňuncultured actinobacterium; GASP-KA1W3_B01#Lineage=Root;rootrank;Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Ilumatobacter;genus°Ä0ĆİÄ


    With unknown characters that does not allow to use it with BLAST option or in any other BLAST search.

    Could any one help me with this issue? I am really stuck at this step...


    Thanks a lot

    MA
  • miguelangel
    Member
    • Jun 2012
    • 16

    #2
    I have tried to do the same but using 'makeblastdb' and now I get this different error:

    Error: (803.7) Blast-def-line-set.E.title
    Bad char [0x9] in string at byte 38
    uncultured bacterium; L2Sp-13 Lineage=Root;rootrank;Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Ilumatobacter;genus

    And a .nhr almost equal to the one generated with 'formatdb'.

    I am sure that the problem is in the format of the original fasta file, that looks like this entry:

    >S000655540 uncultured bacterium; L2Sp-13 Lineage=Root;rootrank;Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Ilumatobacter;genus
    ggaatcttgcgcaatgggcgaaagcctgacgcagcaacgccgcgtgcgggatgaaggccttcgggctgtaaaccgctttc
    agcaggaacgaaaatgacggtacctgcagaagaaggagcggccaactacgtgccagcagccgcggtgacacgtaggctcc
    aagcgttgtccggatttattgggcgtaaagagctcgtaggcggttgagtaagtcgggtgtgaaaactctgggcttaaccc
    ggagacgccatccgatactgctctgactagagttcaggaggggagtggggaattcctagtgtagcggtgaaatgcgcaga
    tattaggaggaacaccggtggcgaaggcgccactctggactgaaactgacgctgaggagcgaaagcatgggtatcaaaca
    ggattagataccctggtactccatgccgtaaacggtgggcactaggtgtgggttccaactaacgggatccgcgccgtcgc
    taacgcattaagtgccccgcctggggagtacggtcgcaagactaaaactcaaatgaattgacgg


    Any idea of how can I change this format to fit into the formatdb/makeblastdb commands?

    Thanks again

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      Problem is likely a tab character (between the S* and the rest of the header?) (based on the 0x9 code in your error). Also ID header line is probably wrapping on to second line (unless your copy/paste did that). You will likely need to reformat the headers.

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Using the "release11_2_Bacteria_unaligned.fa" file downloaded from the link you posted I was able to create the indexes using makeblastdb (v. 2.2.29+) without the errors you saw. I did

        Code:
        $ makeblastdb -dbtype nucl -in release11_2_Bacteria_unaligned.fa
        I got a certain number of errors (below), which may or may not indicate a real problem http://www.acgt.me/blog/2014/5/15/fu...rom-ncbi-blast

        Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 45% ambiguous nucleotides (shouldn't be over 40%)

        Comment

        • miguelangel
          Member
          • Jun 2012
          • 16

          #5
          Thanks a lot

          I got exactly the same errors, so I will try if with these new files are properly formated to run BLAST.

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            I tried a test blast with a few sequences from the RDP fasta file. Worked without any problems.

            If you do not need all the extra stuff in the fasta header ID you could remove most of it using the following command (leaving the S* ID's)

            Code:
            $ sed -e 's/>* .*$//' release11_2_Bacteria_unaligned.fa > release11_2_Bacteria_unaligned_truncated_header.fa
            Then build the indexes from the new file.
            Last edited by GenoMax; 07-04-2014, 12:21 PM.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM
            • SEQadmin2
              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
              by SEQadmin2

              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
              05-06-2026, 09:04 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 08:59 AM
            0 responses
            14 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            22 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            19 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-28-2026, 11:40 AM
            0 responses
            32 views
            0 reactions
            Last Post SEQadmin2  
            Working...