Seqanswers Leaderboard Ad

**tboothby** · 01-18-2012, 11:20 AM

Hi,
I am using blast+. I have formatted a nucleotide database using makeblastdb.

I am trying to extract sequences from a file containing a list of IDs.

Using the following:

Code:

/Users/wolniaklab/blast/programs/blastdbcmd -db /Users/wolniaklab/Desktop/search/seqs2 -dbtype nucl -entry_batch /Users/wolniaklab/Desktop/search/ids1.txt -out /Users/wolniaklab/Desktop/search/output.txt

When I do I get the same error for every ID I am searching for (here is an example):

Code:

Error: >lcl|comp9999_c1_seq11: OID not found

My list of ids is in this format:

Code:

>lcl|comp10021_c0_seq1
>lcl|comp1002_c0_seq1
>lcl|comp10045_c0_seq13
>lcl|comp10045_c0_seq14
>lcl|comp10045_c0_seq19
>lcl|comp10045_c0_seq4
>lcl|comp10045_c0_seq4
>lcl|comp10049_c0_seq4
>lcl|comp10075_c0_seq13
>lcl|comp10075_c0_seq9
>lcl|comp100777_c0_seq1
>lcl|comp10082_c0_seq1

The fast file I made the database from looks like this

Code:

>lcl|comp11191_c0_seq1 len=589 path=[0:0-128 613:129-135 136:136-588]
GTTCTATTGTATTGTTATCCATCTGAGGTTTTCTCTCTGCGTTTGTCTGTGCAGAATCTA
GTGATCTCCCACAACATGATGTGGCCACCAGGGATGGAACAAAGCTGGTGAGAAGGGCCG
ATATGGCTCGAAAAATTCCTCAATTCAAGATACTTTGATCCCTGCACCGAGCACCACTTC
AACAAAAATGAGAAAAACCATTTCTGCATTTGTTGTAATGAAGGTCCTCACTCCCATCAC
CAAACTCTCCAAGTCCGCCGGGCGTCCCATGCCAACTGTGTCCGGGTCGAAAACATCTCC
TAGATTCTAGACATTTCTGGAATTCAAACCTACATCATCAACAACCATAAAATTGTCTTC
CTCCAAAGGCAGGCCAATGTGAAGCAGATCATGTCAAGGTTGTTGATCAGTTCAACAGGA
GGTCTCCATGTCTCTGCTAATGCCAAGCATTGCCATACCTGTGGAAGAGCTTTGTCCACT
GATTTAATGAAGTTTTGCTCCATTAAATGCAAGCTTATGCCTACTTCTTTTAATTTTGTT
TCTAGAATTTGAAACTCATTTTACTAAACTGGTTATATTTTGTTTTTAG
>lcl|comp10877_c0_seq1 len=1212 path=[3176:0-121 3368:122-148 3395:149-192 3439:193-281 4481:282-332 3578:333-1211]
AAAGCATGCCTAAGTCGATTTATTATTAATTTATTTAGTCGCTTTATTCTAACTATCCCG
ACTCAAGCTTAACTAACGGTTCTACTATTCGATTTCCATCTCTAGGTTCGGTTTCTAACT
CGTCTAACTCCCTCGCCTACGGAATTCATGACTTCGGTCATCGCTAACCTCGGCAACCCT
CTACGTGAGTTTAGTCACCAACAGTGTCAAGTTCCGTCCAACAGCGTCAACATCCGTCCG
ACCATCGATATCTATTCATCTCCGTTTAATCTATATCCTACTGTTATTAAACACATTTCC
TATACTATCATGATGTGTCTTTGGGCTCTAGGGATCATATCTACCCACCTATCTAATCTG
ATTGGGTCATCACTTATTAATATACTACAGTGAATCAAGGCTCATCTAGCCTATCTGTCC
TCGGCTTACTATTCCGTCACCCAGAGTACCACCGAACGATGTCGGCCTATCCTCTAATCA
TCCTATCAATCTACTATCACAAGGTGCATCAATTCTACGTCGTTCTATCCAATCGAATCC
GGTCCATACCAATCTCAGTAGCTCCGACATTATTGACACTGTTAGGATCCCGTCGGTCAC
GTCCGTTCGGCTTCACCTTCCCAGCCTTAGTTGCCAGGCCTTAATCTAATCCTAGCTCCT
TATAATCTATATGGATTCTAGTCATATAACGCTAGGAAGATTAACGACTCCCGCTATTTA
CTACCCGATCGGTACGTCATCACACTACTGCCAGTGTATTTCTATTGGAAACCCTAACTC
CATTCTACTATGGTTAAATAAGAGTGGGTTCCTATGGATTAAAGCTCTAGTGTGCTCTTC
CTATGGTACTCATATCTCCTTCCTAAATTACTTACTCAAACACCTCCTTAAGCCAAATTC
TAGAGATATAATAAGTCAAATTCTATAGGGGTTTCTAACCAATTTAGTAGATCTATAACT
TACTTATCCCATAGGTTTCTAACTTACAACTTAGTCCTATAGGGCTTGATTTATTATATA
CAAGATAACTCACTCTATAAGCTTTGCTCACACATCATCTCACACCAATATATACCAAAA
TAGCTCTCAAAAGGATTTGACTCAACACCCCTATGGGATATCATCTAAGTCATCTAATTT
AACTAATATTTCTATTACATGGGCTAGAGTAGGTCTCTTTCAATCAATCATGCACCCATT
CCAAAAGTCTAG
>lcl|comp10877_c0_seq2 len=1160 path=[6037:0-34 11677:35-40 11683:41-46 1200:47-73 1227:74-108 1262:109-1159]
CTCATAGAGAGATTCGTCATCTAGGGAACAATGCAAATGCACACTAAATGAGTTAATTAA
ACATCCAATTATCACCATTAAGCAAGTCAAAATCAATCTAGAGCATTCCATGTGTATGCA
TAAGTTGGAAGTTAGAAAACCTTACCTGGAAGCCCTTCTGAGTACCTTAAAAAACTATAA
AAACTATCTAATCAAGGCAATTAATATAATCTCTAGAATTAATTGTAATTAGAAATCAAG
CTTAAGTCCTAAATATAAAACTAGGGCAAATATAATTATAAGTTAATCCAAGTCCTTATC
AAGTCCTAGTGAATCAAATTTTCAGTCAAGCTAAATCCTCAAAATTAAATATGGAATTAT
GTCAAGGTCAAGGCTTAGTCAGCTTATAATGGTCCTAGGTCTAGTCTAAGTCCTAGGGAA
AAAAAAGAAAGAAGAAAAAAACTAAAAAAACAAGTCAAAACTCATTATAGTGGAAAAATA

I have checked a few of my IDs manually and they are indeed in my database. Can anyone tell me what I am doing wrong? Or suggest another approach?

**arvid** · 01-18-2012, 11:17 PM

Remove the leading ">" in your identifiers - it is not part of the ID, but a part of the FASTA format...

**Volklor** · 02-08-2012, 06:28 PM

fastacmd gives errors

Your response to Anna is almost helpful to me...I have been using perl to extract seqs, but a one-liner, if it works, will be so much more efficient! However, when I tried to use fastacmd, and also blastdbcmd, I got an error for each entry in my query list, like this:

$ fastacmd -d contigs -i fastacmdtest.txt -o cp_contigs.fa
[fastacmd] ERROR: Entry "NODE_21_length_493_cov_13.705882" not found
[fastacmd] ERROR: Entry "NODE_75_length_1153_cov_20.143105" not found
[fastacmd] ERROR: Entry "NODE_2130_length_836_cov_4756.417480" not found
[fastacmd] ERROR: Entry "NODE_2409_length_1402_cov_21.002140" not found
[fastacmd] ERROR: Entry "NODE_2859_length_955_cov_1013.558105" not found

I know these entries are in my db because I copied them directly from the file from which I created the db in order to test the command. The test file looks like this:

NODE_21_length_493_cov_13.705882
NODE_75_length_1153_cov_20.143105
NODE_2130_length_836_cov_4756.417480
NODE_2409_length_1402_cov_21.002140
NODE_2859_length_955_cov_1013.558105

I just re-read the post above from kmcarr about indexing using makeblastdb. I used formatdb, so does the same issue apply there? Any idea what I'm doing wrong?

**Hazel_Tan** · 11-06-2014, 06:16 AM

Extract contigs

Originally posted by kmcarr View Post

Anna,

You can do this yourself and it would be a good learning exercise, but since you have already made a BLAST database of the contigs, NCBI has kindly provided tools for doing exactly what you want.

Create a text file of the contig IDs you want to extract, one ID per line, no other information in the file. Be careful to use the same IDs as BLAST for your contigs. We'll call this file "myContigList.txt".

The command to use depends on whether you are using the old school (C-Toolkit) BLAST or the new BLAST+. These ancillary commands should have been installed when you installed BLAST

Old school use the command "fastacmd"

Code:

$ fastacmd -d myBlastDBName -p protein -i myContigList.txt -o myHitContigs.fasta

You can omit the '-p protein' and let the command guess the DB type.

For the new BLAST+ distribution use "blastdbcmd"

Code:

$ blastdbcmd -db myBlastDBName -dbtype prot -entry_batch myContigList.txt -outfmt %f -out myHitContigs.fasta

Again you could omit the '-dbtype prot' and let the program guess. The -outfmt %f tells the program to output sequences in FASTA format; you could also omit this since this is the default output format.

Hi kmcarr,

I would like to ask I used local blast+ to blast my own genome sequence with a query protein and also query nucleotide. My genome sequence i make it as subject instead of makeblastdb and the command is like this

$tblastx -query /home/hazel/Dekstop/heterobasidion.fasta -subject /home/hazel/Dekstop/Gano.fasta -out tblastx_Result.txt -outfmt 1

My own genome sequence contain 4000 contigs after blast I have 500 contigs which hits the query. What should i do to extract those 500 contigs out of 4000 contigs?
Thank you so much. =)

**mgallo2** · 02-16-2016, 11:55 AM

Looking for some help

Hello all-

Brand new to this site, but think this is the right form to seek help:
I'm trying to 'extract' nucleotide sequences from my results after running a local blastn against my local database. The output format right now is a standard "blast-looking" result page, but it's not easy to work with the results when I want to further compare sequences. (I have a gene- and will have genes- of interest. I want to see how they compare to the local database of my sequences, but then I want the results in FASTA format for further analyses and comparisons). The help manual for local blast doesn't seem to have the answers I am looking for

Will gladly provide additional information if more is needed. I really thank anyone who is able to help!

**westerman** · 02-17-2016, 08:29 AM

[QUOTE=mgallo2;189382\ The output format right now is a standard "blast-looking" result page[/QUOTE]

First, do not output blast results in 'standard' format. Said format is for humans and not computers. I suggest XML format although another one would be suitable.

Second, if you do wish to use standard output and are using Perl then bioperl would be useful. I am not sure how accurate the parsing is for standard blast output though. Likewise Biopython.

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 27 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 43 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 29 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News