Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • New member with a blast database problem

    Hello!

    I am a graduate student at UC Berkeley currently working with raw reads of several transcriptomes in an attempt to find and assemble reads that match a couple of genes I'm studying. This site has already been very useful to me (Thank you!!!!), but I haven't found any answers pertaining to my current problem.

    I was hoping I could get some help with a BLAST problem I'm having. I am working with standalone blast, and am building blast databases from fasta files of transcriptome raw reads. I have successfully used the command:

    makeblastdb -in COST_1_final.fasta -input_type fasta -dbtype nucl -out COST_1_final for several fasta files of raw reads (with varying names of course) but a few of the files result in multiple sets of database files, marked for example <filename>.00.nhr and <filename>.01.nhr with what I believe is an alias file <filename>.nal

    The command as it runs give the usual message:
    Building a new DB, current time: 10/15/2015 10:49:55
    New DB name: COST_u_final
    New DB title: COST_u_final.fasta
    Sequence type: Nucleotide
    Keep Linkouts: T
    Keep MBits: T
    Maximum file size: 1000000000B
    Adding sequences from FASTA; added 16313457 sequences in 937.429 seconds.

    but then results in multiple sets of files.

    Any ideas about how I can make just one set of database files, as has successfully happened with the rest of my fasta files? I would really appreciate any help!!

  • #2
    It is normal to get multiple files per blast database. That is how makeblastdb is supposed to work. Just make sure files for a database stay together in the same directory and you use the "basename" for the database (a suggestion: name your database some thing other than your input file name) when you run your searches.

    Comment


    • #3
      I've always gotten multiple files in the sense of .nhr, .nin, .nsq files, but I am getting 2 of each, like .00.nhr and .01.nhr. Why would that only happen some of the time?

      Comment


      • #4
        That is probably dependent of the size of the input fasta file. e.g. nt database has 32 fragment files now.

        There should a database_name.nal file that enumerates all the file pieces if there are more than one.

        Comment


        • #5
          Ahhh, interesting! So if I search using the command

          blastn -db COST_u_final.fasta -query Genefiles -outfmt 6 -out BLASTresults.txt

          will it search all of the files that were made?

          Thank you for your help! I had no idea this wasn't a problem since I've never seen this happen before.

          Comment


          • #6
            Yes that is correct.

            Just to make things less confusing (to others, if needed later on) don't use the fasta file name as the -out database basename.

            Comment


            • #7
              Thank you for the advice, I will change that practice.

              Comment


              • #8
                Hum. I often use the fast file name as the blastDB name. Keeps them together. The extensions are going to be different so there should be no confusion. @GenoMax: what do you use?

                Comment


                • #9
                  I generally drop the .fasta/.fa part when naming a blast db. (Like NCBI. They don't call their db's nt.fa or nr.fa).

                  The idea of using just the "basename" when specifying a db index is a new for some. I suppose keeping the .fasta/.fa may be more logical for them.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  59 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  57 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  56 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X