SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract sequence from multi fasta file with PERL andreitudor Bioinformatics 27 07-07-2019 07:45 AM
extract full fasta file for local blast hits Oyster Bioinformatics 9 07-07-2019 07:39 AM
Extract subset of Fastq sequences based on a list of IDs pepperoni Bioinformatics 36 05-06-2013 01:38 AM
extract subsequence from genomic fasta file jwhite Bioinformatics 7 06-28-2012 11:15 AM
Extract snp ids seq_GA Bioinformatics 0 11-22-2011 05:09 PM

Reply
 
Thread Tools
Old 07-12-2012, 06:27 AM   #1
angeloulivieri
Member
 
Location: Italy

Join Date: Jul 2012
Posts: 30
Default Extract only sequence ids from fasta file with makeblastdb

Hi all,
i'm new about learning blast and i'm exploring now its functions by command line.
I already know that to make a blastx i have first to indicize my fasta db with makeblastdb.
I already used blast to learn how it works and I would that in the output not all the informations about the sequence are present (code, description,..etc) but only the sequence code.
How can I do it? Somewhere I read that I have to give some parameter to the makeblastdb command.... someone here knows what?

Thanks at all..
angeloulivieri is offline   Reply With Quote
Old 07-12-2012, 10:40 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

When do do a BLAST search (e.g. blastp or blastn), there are several different output formats. The plain text and XML have the original FASTA record descriptions, however this is not (currently) available in the tabular output.
http://blastedbio.blogspot.co.uk/201...criptions.html

Is that what you meant?
maubp is offline   Reply With Quote
Old 07-13-2012, 01:35 AM   #3
angeloulivieri
Member
 
Location: Italy

Join Date: Jul 2012
Posts: 30
Default

Yes.. maybe it has been useful. I find that maybe I could do it also with makeblastd. Because my problem is that I would that blast won't use the complete file with all the informations for each sequence but only the sequence id.
So, in example, the command can be this:

makeblastdb -in db.fasta -title db -parse_seqids -gi_mask

What do you think about?

And maybe later I could use the command blastx with -outfmt "6 qgi sgi"
to let me see only a table with the results and only showing GI for query and sequence..

I'm trying executing them since I don't know if there is a way to see how it has done the db with makeblastdb.
angeloulivieri is offline   Reply With Quote
Old 07-13-2012, 02:23 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I only use -parse_seqids if my FASTA files are labeled using the NCBI style with pipe characters (the vertical bases, |, are called pipes). Otherwise I find it doesn't work very well.
maubp is offline   Reply With Quote
Old 07-13-2012, 02:45 AM   #5
angeloulivieri
Member
 
Location: Italy

Join Date: Jul 2012
Posts: 30
Default

The format of my fasta file are from NCBI and it look like this

tr|H3ISY8|H3ISY8_STRPU description OrganismType Other params

I want that blast use only the first sequence code: H3ISY8

And show me only these in the results...

The command I've written bring me a "0 0 0" file... I don't know why.

If I erase the -outfmt "6 qgi sgi" and tell it only "-outfmt "6" it returns a correct table.
I'm continuing trying with different parameters as input.
angeloulivieri is offline   Reply With Quote
Old 07-16-2012, 02:54 AM   #6
angeloulivieri
Member
 
Location: Italy

Join Date: Jul 2012
Posts: 30
Default

So finally, I've seen a lot of parameter and cannot do it. Can it be concluded that is it not permitted to create the binary database that blast uses, only using the sequence id number?

And there is also no way to have with blastx, in our results, only this code instead that the three parts separated by pipe (|).
angeloulivieri is offline   Reply With Quote
Old 07-16-2012, 10:15 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by angeloulivieri View Post
The format of my fasta file are from NCBI and it look like this

tr|H3ISY8|H3ISY8_STRPU description OrganismType Other params

I want that blast use only the first sequence code: H3ISY8
The simplest way to do that is to make a new FASTA file using that as the ID, and make a BLAST database from that.

Personally I'd use the database as is and process the BLAST output in a script instead.
maubp is offline   Reply With Quote
Old 07-24-2012, 12:40 AM   #8
angeloulivieri
Member
 
Location: Italy

Join Date: Jul 2012
Posts: 30
Default

ok thanks... someone said me that there is a parameter to give to makeblastx... but maybe he's wrong...
angeloulivieri is offline   Reply With Quote
Old 07-24-2012, 01:54 AM   #9
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by angeloulivieri View Post
ok thanks... someone said me that there is a parameter to give to makeblastx... but maybe he's wrong...
As mentioned earlier, you might be able to do it via the makeblastdb -parse_seqids option, but that requires your sequence identifiers follow the NCBI naming conventions with the pipe ("|") symbol.

If your FASTA file identifiers are not already in the expected format, you'd have to modify the FASTA file - and in my view in that case you might as well avoid using this option, and simply format the identifiers exactly as you want them.
maubp is offline   Reply With Quote
Old 07-26-2012, 02:43 AM   #10
angeloulivieri
Member
 
Location: Italy

Join Date: Jul 2012
Posts: 30
Default

Quote:
Originally Posted by maubp View Post
As mentioned earlier, you might be able to do it via the makeblastdb -parse_seqids option, but that requires your sequence identifiers follow the NCBI naming conventions with the pipe ("|") symbol.

If your FASTA file identifiers are not already in the expected format, you'd have to modify the FASTA file - and in my view in that case you might as well avoid using this option, and simply format the identifiers exactly as you want them.
My FASTA file have this kind of header for each sequence:


tr|I1GCL2|I1GCL2_AMPQE Uncharacterized protein OS=Amphimedon queenslandica GN=LOC100637533
PE=4 SV=1


I would that makeblastdb uses only the ID I1GCL2 as identifier. This could be interesting for me since I want the minor possible heavy database to manage. I already have the other informations collected in a db.

I used this command
makeblastdb -in uniprot_kb_2012_06.fasta -title uniprot_kb_2012_06 -parse_seqids

but it doesn't work as I thought... it collects all the informations of the header :-(

Last edited by angeloulivieri; 07-26-2012 at 02:53 AM.
angeloulivieri is offline   Reply With Quote
Old 07-29-2012, 11:43 PM   #11
angeloulivieri
Member
 
Location: Italy

Join Date: Jul 2012
Posts: 30
Default

no one knows how to do it?
angeloulivieri is offline   Reply With Quote
Old 07-30-2012, 01:58 AM   #12
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

You haven't said which output format you are using. The specially formatted identifiers (with the pipe characters) are how BLAST identifies an accession number - which you can ask for explicitly when using the tabular output.

Last edited by maubp; 07-30-2012 at 02:39 AM. Reason: corrected typo
maubp is offline   Reply With Quote
Old 07-30-2012, 02:35 AM   #13
angeloulivieri
Member
 
Location: Italy

Join Date: Jul 2012
Posts: 30
Default

I know that when run blastx I can obtain a tabular output with only the the Accession Numbers but it is a different problem. I would have that when the program makeblastdb creates its binary format db it takes only the accession. The reason is that I already have accessions->descriptions in a db and this way could be useful to reduce the quantity of informations to manage when later I run blastx. I hope to be clear...

(Maybe something could be done by formatdb command but I see that it's an old command)
angeloulivieri is offline   Reply With Quote
Old 07-30-2012, 02:41 AM   #14
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by angeloulivieri View Post
I know that when run blastx I can obtain a tabular output with only the the Accession Numbers but it is a different problem. I would have that when the program makeblastdb creates its binary format db it takes only the accession. The reason is that I already have accessions->descriptions in a db and this way could be useful to reduce the quantity of informations to manage when later I run blastx. I hope to be clear...

(Maybe something could be done by formatdb command but I see that it's an old command)
The old 'legacy' BLAST suite had commands 'formatdb' and 'blastall', but those are replaced in the new BLAST+ suite by 'makeblastdb' and for running BLAST you have get separate tools 'blastp', 'blastn', etc.

Anything you could do with 'formatdb' would (I hope) be supported in 'makeblastdb'.
maubp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO