Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing OIDs during blast db dump?

    Hi all,

    I'm new to using blast (and particularly the command line) so I had a few question/issues that I wasn't sure of the significance of.

    I'm trying to build a blast db which is a subset of nr with only human records. I downloaded a GI list from the Entrez protein database and then ran

    cat gi.txt | blastdbcmd -db nr_humans -entry_batch - -out human_sequences.txt

    While running I am receiving a large number of errors about missing OIDs (e.g. "Error: 567316212: OID not found" ), I've gotten about 500 so far and the database isn't quite finished processing.

    Is this something that is expected (since perhaps Entrez has more proteins than the nr database has) ? Or is this some sort of problem that I should be looking into more closely?


    Long background: I'm planning on running delta-blast against more than 5,000 sequences so I'm trying to set up a local blast system. I've downloaded and installed BLAST+, and the nr database. I've run a few blastp queries against nr and they took an excessive amount of time, additionally I wanted only homo sapiens results, so I created an alias following the instructions here. This results in a much faster query however I wanted to see if rebuilding the database would yield and even faster result, so I followed the instructions here (to some extent, I already had my GI's from the first run).

  • #2
    Sorry I didn't explain that, nr_humans is an alias to the nr database created by applying the original gi list, I followed the instructions here:

    blastdb_aliastool -gilist gi.txt -db nr -out nr_humans -title nr_humans

    Essentially the output of the command should be everything that the alias sees (and would likely be the same as blastdbcmd -db nr_humans -out human_sequences.txt), but the OIDs are missing regardless of whether i use nr or nr_humans (which is expected)

    Comment


    • #3
      I missed the line in your explanation before I read your post again.

      Is the output file being populated irrespective of the database (or alias) being used? nr is so huge at this point in time that it may not be surprising to find errors in it.

      What exactly are you interested in from the human subset from nr?

      Comment


      • #4
        See this post and the "missing OID's": http://blastedbio.blogspot.com/2012/...cbi-blast.html

        Comment


        • #5
          Yes, the file is being populated in either case, and the number of misses seems minute compared to the number of hits, I haven't run both to look for differences but I don't expect to find any (as the alias is a restriction with the list that I'm using to dump anyways).

          At this point I'm interested in doing a homology search for yeast proteins against human proteins. I'm also only interested in humans so that is the reason for the restriction.

          Comment


          • #6
            A quicker way to do this would be to get the human protein sequence complement from a "BioMart" search (http://useast.ensembl.org/info/data/biomart.html) from Ensembl site. I see a total of 64,138 at this time.

            Comment


            • #7
              Hmm, so that link scared me initially but it appears that blastdbcmd is treating the gi's as gi's instead of OIDS (or perhaps they are the same thing in nr), I went through a few pages of the output and they are all [homo sapiens] (or sequences with multiple species at least include it).

              Aside from that possibility I'm not seeing that it is directly related to the problem at hand.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Working...
              X