Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi,
    I am using blast+. I have formatted a nucleotide database using makeblastdb.

    I am trying to extract sequences from a file containing a list of IDs.

    Using the following:
    Code:
    /Users/wolniaklab/blast/programs/blastdbcmd -db /Users/wolniaklab/Desktop/search/seqs2 -dbtype nucl -entry_batch /Users/wolniaklab/Desktop/search/ids1.txt -out /Users/wolniaklab/Desktop/search/output.txt
    When I do I get the same error for every ID I am searching for (here is an example):

    Code:
    Error: >lcl|comp9999_c1_seq11: OID not found
    My list of ids is in this format:
    Code:
    >lcl|comp10021_c0_seq1
    >lcl|comp1002_c0_seq1
    >lcl|comp10045_c0_seq13
    >lcl|comp10045_c0_seq14
    >lcl|comp10045_c0_seq19
    >lcl|comp10045_c0_seq4
    >lcl|comp10045_c0_seq4
    >lcl|comp10049_c0_seq4
    >lcl|comp10075_c0_seq13
    >lcl|comp10075_c0_seq9
    >lcl|comp100777_c0_seq1
    >lcl|comp10082_c0_seq1
    The fast file I made the database from looks like this

    Code:
    >lcl|comp11191_c0_seq1 len=589 path=[0:0-128 613:129-135 136:136-588]
    GTTCTATTGTATTGTTATCCATCTGAGGTTTTCTCTCTGCGTTTGTCTGTGCAGAATCTA
    GTGATCTCCCACAACATGATGTGGCCACCAGGGATGGAACAAAGCTGGTGAGAAGGGCCG
    ATATGGCTCGAAAAATTCCTCAATTCAAGATACTTTGATCCCTGCACCGAGCACCACTTC
    AACAAAAATGAGAAAAACCATTTCTGCATTTGTTGTAATGAAGGTCCTCACTCCCATCAC
    CAAACTCTCCAAGTCCGCCGGGCGTCCCATGCCAACTGTGTCCGGGTCGAAAACATCTCC
    TAGATTCTAGACATTTCTGGAATTCAAACCTACATCATCAACAACCATAAAATTGTCTTC
    CTCCAAAGGCAGGCCAATGTGAAGCAGATCATGTCAAGGTTGTTGATCAGTTCAACAGGA
    GGTCTCCATGTCTCTGCTAATGCCAAGCATTGCCATACCTGTGGAAGAGCTTTGTCCACT
    GATTTAATGAAGTTTTGCTCCATTAAATGCAAGCTTATGCCTACTTCTTTTAATTTTGTT
    TCTAGAATTTGAAACTCATTTTACTAAACTGGTTATATTTTGTTTTTAG
    >lcl|comp10877_c0_seq1 len=1212 path=[3176:0-121 3368:122-148 3395:149-192 3439:193-281 4481:282-332 3578:333-1211]
    AAAGCATGCCTAAGTCGATTTATTATTAATTTATTTAGTCGCTTTATTCTAACTATCCCG
    ACTCAAGCTTAACTAACGGTTCTACTATTCGATTTCCATCTCTAGGTTCGGTTTCTAACT
    CGTCTAACTCCCTCGCCTACGGAATTCATGACTTCGGTCATCGCTAACCTCGGCAACCCT
    CTACGTGAGTTTAGTCACCAACAGTGTCAAGTTCCGTCCAACAGCGTCAACATCCGTCCG
    ACCATCGATATCTATTCATCTCCGTTTAATCTATATCCTACTGTTATTAAACACATTTCC
    TATACTATCATGATGTGTCTTTGGGCTCTAGGGATCATATCTACCCACCTATCTAATCTG
    ATTGGGTCATCACTTATTAATATACTACAGTGAATCAAGGCTCATCTAGCCTATCTGTCC
    TCGGCTTACTATTCCGTCACCCAGAGTACCACCGAACGATGTCGGCCTATCCTCTAATCA
    TCCTATCAATCTACTATCACAAGGTGCATCAATTCTACGTCGTTCTATCCAATCGAATCC
    GGTCCATACCAATCTCAGTAGCTCCGACATTATTGACACTGTTAGGATCCCGTCGGTCAC
    GTCCGTTCGGCTTCACCTTCCCAGCCTTAGTTGCCAGGCCTTAATCTAATCCTAGCTCCT
    TATAATCTATATGGATTCTAGTCATATAACGCTAGGAAGATTAACGACTCCCGCTATTTA
    CTACCCGATCGGTACGTCATCACACTACTGCCAGTGTATTTCTATTGGAAACCCTAACTC
    CATTCTACTATGGTTAAATAAGAGTGGGTTCCTATGGATTAAAGCTCTAGTGTGCTCTTC
    CTATGGTACTCATATCTCCTTCCTAAATTACTTACTCAAACACCTCCTTAAGCCAAATTC
    TAGAGATATAATAAGTCAAATTCTATAGGGGTTTCTAACCAATTTAGTAGATCTATAACT
    TACTTATCCCATAGGTTTCTAACTTACAACTTAGTCCTATAGGGCTTGATTTATTATATA
    CAAGATAACTCACTCTATAAGCTTTGCTCACACATCATCTCACACCAATATATACCAAAA
    TAGCTCTCAAAAGGATTTGACTCAACACCCCTATGGGATATCATCTAAGTCATCTAATTT
    AACTAATATTTCTATTACATGGGCTAGAGTAGGTCTCTTTCAATCAATCATGCACCCATT
    CCAAAAGTCTAG
    >lcl|comp10877_c0_seq2 len=1160 path=[6037:0-34 11677:35-40 11683:41-46 1200:47-73 1227:74-108 1262:109-1159]
    CTCATAGAGAGATTCGTCATCTAGGGAACAATGCAAATGCACACTAAATGAGTTAATTAA
    ACATCCAATTATCACCATTAAGCAAGTCAAAATCAATCTAGAGCATTCCATGTGTATGCA
    TAAGTTGGAAGTTAGAAAACCTTACCTGGAAGCCCTTCTGAGTACCTTAAAAAACTATAA
    AAACTATCTAATCAAGGCAATTAATATAATCTCTAGAATTAATTGTAATTAGAAATCAAG
    CTTAAGTCCTAAATATAAAACTAGGGCAAATATAATTATAAGTTAATCCAAGTCCTTATC
    AAGTCCTAGTGAATCAAATTTTCAGTCAAGCTAAATCCTCAAAATTAAATATGGAATTAT
    GTCAAGGTCAAGGCTTAGTCAGCTTATAATGGTCCTAGGTCTAGTCTAAGTCCTAGGGAA
    AAAAAAGAAAGAAGAAAAAAACTAAAAAAACAAGTCAAAACTCATTATAGTGGAAAAATA
    I have checked a few of my IDs manually and they are indeed in my database. Can anyone tell me what I am doing wrong? Or suggest another approach?

    Comment


    • #17
      Remove the leading ">" in your identifiers - it is not part of the ID, but a part of the FASTA format...

      Comment


      • #18
        fastacmd gives errors

        Your response to Anna is almost helpful to me...I have been using perl to extract seqs, but a one-liner, if it works, will be so much more efficient! However, when I tried to use fastacmd, and also blastdbcmd, I got an error for each entry in my query list, like this:

        $ fastacmd -d contigs -i fastacmdtest.txt -o cp_contigs.fa
        [fastacmd] ERROR: Entry "NODE_21_length_493_cov_13.705882" not found
        [fastacmd] ERROR: Entry "NODE_75_length_1153_cov_20.143105" not found
        [fastacmd] ERROR: Entry "NODE_2130_length_836_cov_4756.417480" not found
        [fastacmd] ERROR: Entry "NODE_2409_length_1402_cov_21.002140" not found
        [fastacmd] ERROR: Entry "NODE_2859_length_955_cov_1013.558105" not found

        I know these entries are in my db because I copied them directly from the file from which I created the db in order to test the command. The test file looks like this:

        NODE_21_length_493_cov_13.705882
        NODE_75_length_1153_cov_20.143105
        NODE_2130_length_836_cov_4756.417480
        NODE_2409_length_1402_cov_21.002140
        NODE_2859_length_955_cov_1013.558105

        I just re-read the post above from kmcarr about indexing using makeblastdb. I used formatdb, so does the same issue apply there? Any idea what I'm doing wrong?
        Last edited by Volklor; 02-08-2012, 06:47 PM.

        Comment


        • #19
          Extract contigs

          Originally posted by kmcarr View Post
          Anna,

          You can do this yourself and it would be a good learning exercise, but since you have already made a BLAST database of the contigs, NCBI has kindly provided tools for doing exactly what you want.

          Create a text file of the contig IDs you want to extract, one ID per line, no other information in the file. Be careful to use the same IDs as BLAST for your contigs. We'll call this file "myContigList.txt".

          The command to use depends on whether you are using the old school (C-Toolkit) BLAST or the new BLAST+. These ancillary commands should have been installed when you installed BLAST

          Old school use the command "fastacmd"

          Code:
          $ fastacmd -d myBlastDBName -p protein -i myContigList.txt -o myHitContigs.fasta
          You can omit the '-p protein' and let the command guess the DB type.

          For the new BLAST+ distribution use "blastdbcmd"

          Code:
          $ blastdbcmd -db myBlastDBName -dbtype prot -entry_batch myContigList.txt -outfmt %f -out myHitContigs.fasta
          Again you could omit the '-dbtype prot' and let the program guess. The -outfmt %f tells the program to output sequences in FASTA format; you could also omit this since this is the default output format.


          Hi kmcarr,

          I would like to ask I used local blast+ to blast my own genome sequence with a query protein and also query nucleotide. My genome sequence i make it as subject instead of makeblastdb and the command is like this
          $tblastx -query /home/hazel/Dekstop/heterobasidion.fasta -subject /home/hazel/Dekstop/Gano.fasta -out tblastx_Result.txt -outfmt 1
          My own genome sequence contain 4000 contigs after blast I have 500 contigs which hits the query. What should i do to extract those 500 contigs out of 4000 contigs?
          Thank you so much. =)

          Comment


          • #20
            Looking for some help

            Hello all-

            Brand new to this site, but think this is the right form to seek help:
            I'm trying to 'extract' nucleotide sequences from my results after running a local blastn against my local database. The output format right now is a standard "blast-looking" result page, but it's not easy to work with the results when I want to further compare sequences. (I have a gene- and will have genes- of interest. I want to see how they compare to the local database of my sequences, but then I want the results in FASTA format for further analyses and comparisons). The help manual for local blast doesn't seem to have the answers I am looking for

            Will gladly provide additional information if more is needed. I really thank anyone who is able to help!

            Comment


            • #21
              [QUOTE=mgallo2;189382\ The output format right now is a standard "blast-looking" result page[/QUOTE]

              First, do not output blast results in 'standard' format. Said format is for humans and not computers. I suggest XML format although another one would be suitable.

              Second, if you do wish to use standard output and are using Perl then bioperl would be useful. I am not sure how accurate the parsing is for standard blast output though. Likewise Biopython.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 08:47 AM
              0 responses
              15 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X