SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BLAST+ creating custom blast database and using blast+ filtering features deniz Bioinformatics 3 07-07-2019 08:04 AM
Standalone BLAST output MattN Bioinformatics 8 02-01-2015 11:17 PM
Standalone Blast output jomaco Bioinformatics 1 01-31-2012 07:18 AM
convert blast output rururara Bioinformatics 1 04-08-2011 12:48 AM
BLAST output parsing in R? rdu Bioinformatics 3 01-25-2011 06:25 AM

Reply
 
Thread Tools
Old 09-12-2011, 04:52 PM   #1
Hilary
Junior Member
 
Location: New Zealand

Join Date: May 2009
Posts: 3
Default Getting sequence descriptions into BLAST output

I am blasting a transcriptome dataset against some vertebrate unigene sets using standalone BLAST+. I want my output in a tab-delimited file so have been using outfmt 6 or 7. However, this only returns the ID (or accession or GI) of the hit - ie something like this: gnl|UG|Gga#S19183375, but what I would really like to have is the description of what this gene actually is - for the entry above it is "Gallus gallus breast cancer 2, early onset (BRCA2), mRNA" (ie the Sequence Definition line in Genbank format).

The xml file has this information under hit_def, but how can I get this into the tab-delimited output?
Hilary is offline   Reply With Quote
Old 09-13-2011, 05:29 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by Hilary View Post
The xml file has this information under hit_def, but how can I get this into the tab-delimited output?
Short answer: you can't.

The hit-def is simply not one of the available fields to display for any of the tabular formats. If you want to include the definition in a tabular style report you will have to output a full report from blast (e.g. as text or XML) and then parse that report yourself to produce the table you want.
kmcarr is offline   Reply With Quote
Old 09-13-2011, 03:05 PM   #3
grassgirl
Member
 
Location: Oregon, USA

Join Date: Mar 2011
Posts: 51
Default

Quote:
If you want to include the definition in a tabular style report you will have to output a full report from blast (e.g. as text or XML) and then parse that report yourself to produce the table you want.
I am interested in doing such a task. I would like to split a large xml file into smaller ones to be viewable in Excel and to include certain fields (such as hit-def). I am trying to learn Perl from the ground up, let alone BioPerl and meanwhile my data sits. I've seen your code (kmcarr) for parsing xml files into single-hit files (I believe), but have not as of yet figured out how to do what I want. Any hints on commands for BioPerl would be great (and I do know that I need to learn the workings of Perl and BioPerl still).
grassgirl is offline   Reply With Quote
Old 09-13-2011, 03:44 PM   #4
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by kmcarr View Post
Short answer: you can't.
But....I think you can.

Isn't that what '-m 9' is for in legacy blast? I don't know what the equivalent is in blast+ but I have done this sort of thing before (with blast).

@Hilary,

One solution might be to use blastall if you want those descriptions.
SES is offline   Reply With Quote
Old 09-13-2011, 04:21 PM   #5
grassgirl
Member
 
Location: Oregon, USA

Join Date: Mar 2011
Posts: 51
Default

Quote:
But....I think you can.

Isn't that what '-m 9' is for in legacy blast? I don't know what the equivalent is in blast+ but I have done this sort of thing before (with blast).
No that's not what she is looking for. -m 9 is tabular output with comments (blast+ equivalent is 'outfmt 7') , but the comments aren't gene descriptions for each hit, simply a description before each query stating number of hits and some other info. Here's an example:

# BLASTX 2.2.25+
# Query: isotig06251gene=isogroup05465length=1481numContigs=1
# Database: nr
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 54 hits found

The xml format indeed has hit descriptions (gene name, hypothetical protein, etc.).
grassgirl is offline   Reply With Quote
Old 09-13-2011, 04:36 PM   #6
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by grassgirl View Post
No that's not what she is looking for. -m 9 is tabular output with comments (blast+ equivalent is 'outfmt 7') , but the comments aren't gene descriptions for each hit, simply a description before each query stating number of hits and some other info. Here's an example:

# BLASTX 2.2.25+
# Query: isotig06251gene=isogroup05465length=1481numContigs=1
# Database: nr
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 54 hits found

The xml format indeed has hit descriptions (gene name, hypothetical protein, etc.).
Okay. So, why not just produce the blastxml and parse it.

Code:
my $search_in = Bio::SearchIO->new(-format=>'blastxml', -file=> $infile);
while ( my $result = $search_in->next_result ) {
     while( my $hit = $result->next_hit ) {
          print "$hit->description\n";
      }
}
Or, the equivalent in your language of choice.
SES is offline   Reply With Quote
Old 09-14-2011, 09:24 AM   #7
grassgirl
Member
 
Location: Oregon, USA

Join Date: Mar 2011
Posts: 51
Default

Thanks for the code SES. By the way, does anyone know a good basic book that will help with BioPerl? I have found the wiki pages to be a bit difficult to decipher.
grassgirl is offline   Reply With Quote
Old 09-14-2011, 09:40 AM   #8
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Have a look at the HOWTOs (google 'BioPerl HOWTO' and that should get you there) as they are very extensive. If you are just starting out and do not know very much Perl then it is probably going to be extremely difficult to try and understand how BioPerl works. I would say the best thing to do is get an Intro Perl book like 'Learning Perl' from O'Reilly or 'Beginning Perl for Bioinformatics' though the latter is quite dated now (published in 1996) and does not follow modern perl pragmas. Still, both are very useful. You may also find this short course from Ian Korf helpful: http://korflab.ucdavis.edu/Unix_and_Perl/
SES is offline   Reply With Quote
Old 09-14-2011, 10:07 AM   #9
grassgirl
Member
 
Location: Oregon, USA

Join Date: Mar 2011
Posts: 51
Default

Thanks, SES. I'm working my way through Learning Perl now...being impatient to understand the BioPerl modules. I did just find the beginning BioPerl HowTo, which is good because simply looking up a particular module's HowTo doesn't really go into some of the basics needed to understand what's going on with a piece of code.
grassgirl is offline   Reply With Quote
Old 09-14-2011, 10:11 AM   #10
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by grassgirl View Post
I am interested in doing such a task. I would like to split a large xml file into smaller ones to be viewable in Excel and to include certain fields (such as hit-def). I am trying to learn Perl from the ground up, let alone BioPerl and meanwhile my data sits. I've seen your code (kmcarr) for parsing xml files into single-hit files (I believe), but have not as of yet figured out how to do what I want. Any hints on commands for BioPerl would be great (and I do know that I need to learn the workings of Perl and BioPerl still).
Quote:
Originally Posted by grassgirl View Post
Thanks for the code SES. By the way, does anyone know a good basic book that will help with BioPerl? I have found the wiki pages to be a bit difficult to decipher.
Hi GG,

I concur with SES about the solution an the documentation (here's the URL for the HOWTO if you haven't seen it http://www.bioperl.org/wiki/HOWTO:SearchIO). SES's and the first example on the HOWTO give the skeleton for what you want to do. The basic steps are:

1. Create a new SearchIO object to read you input file. One suggestion I would have is that in addition to the -format and -file parameters you can add "-best_hit_only=>'true'" to your new statement. This will only pull in the first (best) hit for each query. If you only want the description for the first hit then this is all you will need.

2. Read the next_result from the file. A result corresponds to a query sequence. It contains the data about the query (e.g. name, length etc.)

3. Read the next_hit for the current result. Every result may have 0 or more hits. Zero if the query had no matches; one to many depending on how many matches were found (and how many you chose to report when you ran the BLAST job). As stated above if you used the "-best_hit_only" flag then only the first hit will be available.

4. Print the desired information for this hit. The information for a hit and the methods to access it are listed in Table 2.2 of the HOWTO.

[From the description of your desired output I don't think you need to delve down further into the HSP (High Scoring Pairs) components of each hit. If this is the case you might want to look at the section of the HOWTO "Speed improvements with lightweight objects". Combining this with -best_hit_only can really speed up you code since the amount of data that has to be parsed for each query is significantly reduced.]

5. Repeat step #4 if you want information about multiple hits for each query (and remember not to use "-best_hit_only" when you create the reader object).

6. Repeat from step #3 for the next query sequence.

I understand that BioPerl can be hard to learn but the SearchIO module is a good place to start. The objects and methods are pretty much what the names imply and the data structures are a pretty straightforward hierarchy. Dive in and give it a try.
kmcarr is offline   Reply With Quote
Old 09-14-2011, 10:39 AM   #11
grassgirl
Member
 
Location: Oregon, USA

Join Date: Mar 2011
Posts: 51
Default

Thanks for the walk-through kmcarr, as well as the input on the HowTo. Diving in.

Hopefully this has been a help to Hilary as well.
grassgirl is offline   Reply With Quote
Old 01-25-2012, 12:13 AM   #12
Aeolus Huios
Junior Member
 
Location: hyderabad, India

Join Date: Nov 2011
Posts: 9
Default

Hi My question is not relavent Here anybody kindly Help. I have written a script using Bioperl module io::SearchIO for parsing blast result.
my $searchio = new Bio::SearchIO( -format => 'blast', -file => $blast_report, -best_hit_only => 1 );
This is my part of script but stil i am not ablet to get the best hit of every query.

Can anybody help me out.

With regards,
Pawan
Aeolus Huios is offline   Reply With Quote
Old 01-25-2012, 12:14 AM   #13
Aeolus Huios
Junior Member
 
Location: hyderabad, India

Join Date: Nov 2011
Posts: 9
Default

Hi,
My question is not relavent Here anybody kindly Help. I have written a script using Bioperl module io::SearchIO for parsing blast result.
my $searchio = new Bio::SearchIO( -format => 'blast', -file => $blast_report, -best_hit_only => 1 );
This is my part of script but stil i am not ablet to get the best hit only of every query.

Can anybody help me out.

With regards,
Pawan
Aeolus Huios is offline   Reply With Quote
Old 01-25-2012, 04:38 PM   #14
grassgirl
Member
 
Location: Oregon, USA

Join Date: Mar 2011
Posts: 51
Default

Hi,

I'm new to BioPerl myself, but you may have to put your 1 in quotes for the best_hit_only argument, or try this as kmcarr wrote in the message above instead of the way you wrote it:

-best_hit_only => 'true'
grassgirl is offline   Reply With Quote
Old 01-25-2012, 08:04 PM   #15
Aeolus Huios
Junior Member
 
Location: hyderabad, India

Join Date: Nov 2011
Posts: 9
Unhappy Getting sequence descriptions into BLAST output



Hi Grassgirl,
Thanks for reply but it does not work when i try. i wrote the code in this way:
my $searchio = new Bio::SearchIO( -format => 'blast', -file => $blast_report, -best_hit_only => 'true');

Eventhough,It gives error.

With regards,
Aeolus
Aeolus Huios is offline   Reply With Quote
Old 01-26-2012, 08:12 AM   #16
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by Aeolus Huios View Post


Hi Grassgirl,
Thanks for reply but it does not work when i try. i wrote the code in this way:
my $searchio = new Bio::SearchIO( -format => 'blast', -file => $blast_report, -best_hit_only => 'true');

Eventhough,It gives error.

With regards,
Aeolus
Hi Aeolus or Pawan (or whoever you are),

Please have a look at this document. You have posted this question on a couple of threads here and you have asked the same thing in several posts to the BioPerl listserv. People have tried to help you but you have still not shown us any of your code or what error you are talking about. There is a lot of BioPerl documentation on this topic (try google) and old listserv conversations which you can search (here for example). Please realize that the solution to your problem is not to post the same question over and over again on different listservs or websites, but to heed the advice which you have been given (show us your code, describe the errors, etc.). Also, it would be better to follow up on the BioPerl listserv since it will be easier for others to google for the conversation later.
SES is offline   Reply With Quote
Old 01-26-2012, 09:28 AM   #17
grassgirl
Member
 
Location: Oregon, USA

Join Date: Mar 2011
Posts: 51
Default

SES took the words right out of my mouth. Without all of your code and information on the errors you are getting, it is difficult to help you beyond telling you what I did. For instance, did you read the next_result and next_hit, did you print the results? You may even be missing a BioPerl module, but without all the information no one will know. The BioPerl listserve folks are very helpful, but they will be adamant about including all your code and errors exactly as they come out.

Here's a link to the BioPerl-l listserve Info Page. You can email your question to them, the address is on this page. Also, take a look at the archives to get an idea of how folks are posting their questions:

http://bioperl.org/mailman/listinfo/bioperl-l
grassgirl is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO