Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I format RDP database to be used in a BLAST search?

    Hello there and thank you for your welcome!

    I have to format RDP 16S bacterial database in fasta format (downladed from here:http://rdp.cme.msu.edu/misc/resources.jsp) to fit in a BLAST search, carried out with QIIME command 'assign_taxonomy.py'.

    I need to create an index from the fasta file and I have read that this can be done by using 'formatdb' in the BLAST standsalone program, but when I try to do it I always get a message like this one:

    [formatdb 2.2.22] ERROR: RDP_11_2_index.txt.nhrOutput
    Blast-def-line-set.E.<title>
    Invalid value(s) [9] in VisibleString [uncultured bacterium; DolOr_72351#Lineage=Root;rootrank;Bacteria;domain;unclassified_Bacteria; ...]

    However, I get a .nhr file, but with data in this shape:

    S000655540¢Ä0Ä0ĆÄňuncultured actinobacterium; GASP-KA1W3_B01#Lineage=Root;rootrank;Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Ilumatobacter;genus°Ä0ĆİÄ


    With unknown characters that does not allow to use it with BLAST option or in any other BLAST search.

    Could any one help me with this issue? I am really stuck at this step...


    Thanks a lot

    MA

  • #2
    I have tried to do the same but using 'makeblastdb' and now I get this different error:

    Error: (803.7) Blast-def-line-set.E.title
    Bad char [0x9] in string at byte 38
    uncultured bacterium; L2Sp-13 Lineage=Root;rootrank;Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Ilumatobacter;genus

    And a .nhr almost equal to the one generated with 'formatdb'.

    I am sure that the problem is in the format of the original fasta file, that looks like this entry:

    >S000655540 uncultured bacterium; L2Sp-13 Lineage=Root;rootrank;Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Ilumatobacter;genus
    ggaatcttgcgcaatgggcgaaagcctgacgcagcaacgccgcgtgcgggatgaaggccttcgggctgtaaaccgctttc
    agcaggaacgaaaatgacggtacctgcagaagaaggagcggccaactacgtgccagcagccgcggtgacacgtaggctcc
    aagcgttgtccggatttattgggcgtaaagagctcgtaggcggttgagtaagtcgggtgtgaaaactctgggcttaaccc
    ggagacgccatccgatactgctctgactagagttcaggaggggagtggggaattcctagtgtagcggtgaaatgcgcaga
    tattaggaggaacaccggtggcgaaggcgccactctggactgaaactgacgctgaggagcgaaagcatgggtatcaaaca
    ggattagataccctggtactccatgccgtaaacggtgggcactaggtgtgggttccaactaacgggatccgcgccgtcgc
    taacgcattaagtgccccgcctggggagtacggtcgcaagactaaaactcaaatgaattgacgg


    Any idea of how can I change this format to fit into the formatdb/makeblastdb commands?

    Thanks again

    Comment


    • #3
      Problem is likely a tab character (between the S* and the rest of the header?) (based on the 0x9 code in your error). Also ID header line is probably wrapping on to second line (unless your copy/paste did that). You will likely need to reformat the headers.

      Comment


      • #4
        Using the "release11_2_Bacteria_unaligned.fa" file downloaded from the link you posted I was able to create the indexes using makeblastdb (v. 2.2.29+) without the errors you saw. I did

        Code:
        $ makeblastdb -dbtype nucl -in release11_2_Bacteria_unaligned.fa
        I got a certain number of errors (below), which may or may not indicate a real problem http://www.acgt.me/blog/2014/5/15/fu...rom-ncbi-blast

        Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 45% ambiguous nucleotides (shouldn't be over 40%)

        Comment


        • #5
          Thanks a lot

          I got exactly the same errors, so I will try if with these new files are properly formated to run BLAST.

          Comment


          • #6
            I tried a test blast with a few sequences from the RDP fasta file. Worked without any problems.

            If you do not need all the extra stuff in the fasta header ID you could remove most of it using the following command (leaving the S* ID's)

            Code:
            $ sed -e 's/>* .*$//' release11_2_Bacteria_unaligned.fa > release11_2_Bacteria_unaligned_truncated_header.fa
            Then build the indexes from the new file.
            Last edited by GenoMax; 07-04-2014, 12:21 PM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            39 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            41 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            35 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X