Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Blast organism mask

    Hi All,

    I know this should be simple, but I just can't find an example of how to do this.. I'll keep looking, but hoped it might save time if someone already knew..

    I was to run blastx on about 1.3M reads, so thought I might speed it up by using a smaller database or masking 'nr'. The BLAST manual shows ways to mask low complexity, but is there a way to only look for a certain taxon hit?

    I have tried to D/L by entrez all the seqs for the taxon (familly level) that I am interested in, but the D/L goes very slow from NCBI until it gets to about 4k/s ! I have always had that problem when trying to D/L a lot from Entrez searches, I guess it must be some kind of restriction..

    Thanks for any help!
    S.

  • #2
    There may be an easier way (which I'd love to hear), but what I normally end up doing is creating a separate species-specific database.

    If you've been having downloading the sequences from NCBI you can download a GI list (file should be a lot smaller) and use the blastdb_aliastool with the -gilist flag to create a restricted database alias which you can then blast against.

    See http://www.biostars.org/p/6528/ in the second "update"

    Comment


    • #3
      That sounds perfecf, I'll try that,

      Thanks,

      s.

      btw, I tried this and always got errors about missing parameters:
      Last edited by susanklein; 03-02-2014, 04:14 PM.

      Comment


      • #4
        opk, almost there.

        The approach above seems good, but I get missing GI errors:
        'BLAST Database error: BLASTDB alias file creation failed. Some referenced files
        may be missing'

        I guess that I need to update my 'nr' or use remote..

        Comment


        • #5
          hmm.. no, still getting the error. Very strange.

          Comment


          • #6
            I might be wrong, but that sounds like an error with nr itself, not with the GI list containing sequences that are not in your database. What happens if you run the following:

            blastdbcmd -db nr -info

            Are you able to blast things normally against nr?
            Last edited by atcghelix; 03-02-2014, 11:27 PM.

            Comment


            • #7
              Yes, blastx works normally. I'm updating it now (have a local nr db) but its very slow atm. I'll update if it works.

              Thanks,

              S.

              Comment


              • #8
                Cool. If it doesn't work after updating then it'd be helpful to see the blastdb_aliastool command you used.

                Comment


                • #9
                  Also, you could test to see if the error is due to missing GIs in the database by making another GI list with a single GI that you know should be in nr.

                  Comment


                  • #10
                    blastdbcmd seems ok. still getting the error though.. even after updating nr (ignore folders etc. changed because sensitive..:

                    C:\Users>blastdb_aliastool -gilist C:\sequence.gi.txt -db C:\blastnrDB\nr -out C:\bacDB

                    Converted 944839 GIs from C:\sequence.gi.txt to binary format in C:\bacDB\nr
                    bacDBp.gil
                    BLAST Database error: BLASTDB alias file creation failed. Some referenced files may be missing

                    blastcmd:

                    C:\Users\>blastdbcmd -db C:\_VirtualBoxShared\blastnrDB\nr -info
                    Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental s
                    amples from WGS projects
                    35,099,569 sequences; 12,354,844,215 total residues

                    Date: Dec 13, 2013 8:13 AM Longest sequence: 41,943 residues

                    Volumes:
                    C:\_VirtualBoxShared\blastnrDB\nr.00
                    C:\_VirtualBoxShared\blastnrDB\nr.01
                    C:\_VirtualBoxShared\blastnrDB\nr.02
                    C:\_VirtualBoxShared\blastnrDB\nr.03
                    C:\_VirtualBoxShared\blastnrDB\nr.04
                    C:\_VirtualBoxShared\blastnrDB\nr.05
                    C:\_VirtualBoxShared\blastnrDB\nr.06
                    C:\_VirtualBoxShared\blastnrDB\nr.07
                    C:\_VirtualBoxShared\blastnrDB\nr.08
                    C:\_VirtualBoxShared\blastnrDB\nr.09
                    C:\_VirtualBoxShared\blastnrDB\nr.10
                    C:\_VirtualBoxShared\blastnrDB\nr.11
                    C:\_VirtualBoxShared\blastnrDB\nr.12
                    C:\_VirtualBoxShared\blastnrDB\nr.13
                    C:\_VirtualBoxShared\blastnrDB\nr.14

                    Comment


                    • #11
                      My other route to doing this was to take the list of GIs downloaded from Entrez web page and use a python script to d/l the seqs into a fasta file. Turns out Entrez limits d/l to 10000 at a time, so I tried 100 at a time and concatenated the result. So I have the fasta files now to build my db but now make db won't work!

                      python code:
                      from Bio import Entrez
                      import sys
                      import os

                      #usage: entreget.py database_type GI_list.txt output.fasta

                      entrezDbName = sys.argv[1]#e.g. 'protein'

                      Entrez.email = 'xxxxxxxxxxxxxxx'
                      inputfile = sys.argv[2]
                      outputfile = sys.argv[3]

                      f = open(inputfile,'r')
                      g = open(outputfile,'w')

                      ids = []
                      c=0
                      dataout = ""
                      t=0


                      for line in f: #entrez fetch in batches of 100

                      c=c+1
                      if c<100:
                      line = line.rstrip()
                      ids.append(line)
                      if c == 100:
                      print t
                      t=t+100

                      try:
                      entryData = Entrez.efetch(db=entrezDbName, id=ids, rettype='fasta').read()
                      except:
                      pass

                      dataout = dataout + entryData
                      c=0
                      ids=[]

                      entryData = Entrez.efetch(db=entrezDbName, id=ids, rettype='fasta').read()
                      dataout = dataout + entryData

                      #print ids
                      #entryData = Entrez.efetch(db=entrezDbName, id=ids, rettype='fasta').read()

                      #print entryData

                      g.write(dataout)

                      g.close()
                      f.close()
                      //////////////////////

                      makeblastdb error:
                      C:\Users>makeblastdb -in C:\bactDB\bact.fna -input_type 'fasta' -dbtype 'prot'
                      USAGE
                      makeblastdb.exe [-h] [-help] [-in input_file] [-input_type type]
                      -dbtype molecule_type [-title database_title] [-parse_seqids]
                      [-hash_index] [-mask_data mask_data_files] [-gi_mask]
                      [-gi_mask_name gi_based_mask_names] [-out database_name]
                      [-max_file_sz number_of_bytes] [-taxid TaxID] [-taxid_map TaxIDMapFile]
                      [-logfile File_Name] [-version]

                      DESCRIPTION
                      Application to create BLAST databases, version 2.2.28+

                      Use '-help' to print detailed descriptions of command line arguments
                      ========================================================================

                      Error: Too many positional arguments (1), the offending value: ûin

                      Comment


                      • #12
                        A 'test' file with a few hundred GIs in the list for blastdb_aliastool works fine. So the problem is that the Entrez download of GIs contains sequences NOT in the nr database, which causes the problem. I can't see a solution to this, other than perhaps another script to remove the GIs not in nr first.. but this is getting ridiculous!

                        S.

                        Comment


                        • #13
                          ok.. I got rid of the makeblastdb error:
                          Error: Too many positional arguments (1), the offending value: ûin

                          it was because of 'phoney' dashes '-' in the text which I stupidly copied from my notes in word. Think I'll start just using notepad++ form my notes! :P

                          S.

                          Comment


                          • #14
                            Nice--good sleuthing!

                            I love notepad++. It's what I miss most from Windows.

                            Comment


                            • #15
                              If you want to limit to a specific taxon, you can use the the entrez_query parameter, e.g.:

                              Code:
                              blastn -db nt -entrez_query txid5693 -remote -query <(echo CTTTTTTTTCTTTTTTTAAAAGTTTTTGAAAAAAGGAAAAAGAAAAATTTTCTTTAGGTTGGGATGTGATTTTATT)
                              -Keith

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X