Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Getting sequence descriptions into BLAST output

    I am blasting a transcriptome dataset against some vertebrate unigene sets using standalone BLAST+. I want my output in a tab-delimited file so have been using outfmt 6 or 7. However, this only returns the ID (or accession or GI) of the hit - ie something like this: gnl|UG|Gga#S19183375, but what I would really like to have is the description of what this gene actually is - for the entry above it is "Gallus gallus breast cancer 2, early onset (BRCA2), mRNA" (ie the Sequence Definition line in Genbank format).

    The xml file has this information under hit_def, but how can I get this into the tab-delimited output?

  • #2
    Originally posted by Hilary View Post
    The xml file has this information under hit_def, but how can I get this into the tab-delimited output?
    Short answer: you can't.

    The hit-def is simply not one of the available fields to display for any of the tabular formats. If you want to include the definition in a tabular style report you will have to output a full report from blast (e.g. as text or XML) and then parse that report yourself to produce the table you want.

    Comment


    • #3
      If you want to include the definition in a tabular style report you will have to output a full report from blast (e.g. as text or XML) and then parse that report yourself to produce the table you want.
      I am interested in doing such a task. I would like to split a large xml file into smaller ones to be viewable in Excel and to include certain fields (such as hit-def). I am trying to learn Perl from the ground up, let alone BioPerl and meanwhile my data sits. I've seen your code (kmcarr) for parsing xml files into single-hit files (I believe), but have not as of yet figured out how to do what I want. Any hints on commands for BioPerl would be great (and I do know that I need to learn the workings of Perl and BioPerl still).

      Comment


      • #4
        Originally posted by kmcarr View Post
        Short answer: you can't.
        But....I think you can.

        Isn't that what '-m 9' is for in legacy blast? I don't know what the equivalent is in blast+ but I have done this sort of thing before (with blast).

        @Hilary,

        One solution might be to use blastall if you want those descriptions.

        Comment


        • #5
          But....I think you can.

          Isn't that what '-m 9' is for in legacy blast? I don't know what the equivalent is in blast+ but I have done this sort of thing before (with blast).
          No that's not what she is looking for. -m 9 is tabular output with comments (blast+ equivalent is 'outfmt 7') , but the comments aren't gene descriptions for each hit, simply a description before each query stating number of hits and some other info. Here's an example:

          # BLASTX 2.2.25+
          # Query: isotig06251gene=isogroup05465length=1481numContigs=1
          # Database: nr
          # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
          # 54 hits found

          The xml format indeed has hit descriptions (gene name, hypothetical protein, etc.).

          Comment


          • #6
            Originally posted by grassgirl View Post
            No that's not what she is looking for. -m 9 is tabular output with comments (blast+ equivalent is 'outfmt 7') , but the comments aren't gene descriptions for each hit, simply a description before each query stating number of hits and some other info. Here's an example:

            # BLASTX 2.2.25+
            # Query: isotig06251gene=isogroup05465length=1481numContigs=1
            # Database: nr
            # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
            # 54 hits found

            The xml format indeed has hit descriptions (gene name, hypothetical protein, etc.).
            Okay. So, why not just produce the blastxml and parse it.

            Code:
            my $search_in = Bio::SearchIO->new(-format=>'blastxml', -file=> $infile);
            while ( my $result = $search_in->next_result ) {
                 while( my $hit = $result->next_hit ) {
                      print "$hit->description\n";
                  }
            }
            Or, the equivalent in your language of choice.

            Comment


            • #7
              Thanks for the code SES. By the way, does anyone know a good basic book that will help with BioPerl? I have found the wiki pages to be a bit difficult to decipher.

              Comment


              • #8
                Have a look at the HOWTOs (google 'BioPerl HOWTO' and that should get you there) as they are very extensive. If you are just starting out and do not know very much Perl then it is probably going to be extremely difficult to try and understand how BioPerl works. I would say the best thing to do is get an Intro Perl book like 'Learning Perl' from O'Reilly or 'Beginning Perl for Bioinformatics' though the latter is quite dated now (published in 1996) and does not follow modern perl pragmas. Still, both are very useful. You may also find this short course from Ian Korf helpful: http://korflab.ucdavis.edu/Unix_and_Perl/

                Comment


                • #9
                  Thanks, SES. I'm working my way through Learning Perl now...being impatient to understand the BioPerl modules. I did just find the beginning BioPerl HowTo, which is good because simply looking up a particular module's HowTo doesn't really go into some of the basics needed to understand what's going on with a piece of code.

                  Comment


                  • #10
                    Originally posted by grassgirl View Post
                    I am interested in doing such a task. I would like to split a large xml file into smaller ones to be viewable in Excel and to include certain fields (such as hit-def). I am trying to learn Perl from the ground up, let alone BioPerl and meanwhile my data sits. I've seen your code (kmcarr) for parsing xml files into single-hit files (I believe), but have not as of yet figured out how to do what I want. Any hints on commands for BioPerl would be great (and I do know that I need to learn the workings of Perl and BioPerl still).
                    Originally posted by grassgirl View Post
                    Thanks for the code SES. By the way, does anyone know a good basic book that will help with BioPerl? I have found the wiki pages to be a bit difficult to decipher.
                    Hi GG,

                    I concur with SES about the solution an the documentation (here's the URL for the HOWTO if you haven't seen it http://www.bioperl.org/wiki/HOWTO:SearchIO). SES's and the first example on the HOWTO give the skeleton for what you want to do. The basic steps are:

                    1. Create a new SearchIO object to read you input file. One suggestion I would have is that in addition to the -format and -file parameters you can add "-best_hit_only=>'true'" to your new statement. This will only pull in the first (best) hit for each query. If you only want the description for the first hit then this is all you will need.

                    2. Read the next_result from the file. A result corresponds to a query sequence. It contains the data about the query (e.g. name, length etc.)

                    3. Read the next_hit for the current result. Every result may have 0 or more hits. Zero if the query had no matches; one to many depending on how many matches were found (and how many you chose to report when you ran the BLAST job). As stated above if you used the "-best_hit_only" flag then only the first hit will be available.

                    4. Print the desired information for this hit. The information for a hit and the methods to access it are listed in Table 2.2 of the HOWTO.

                    [From the description of your desired output I don't think you need to delve down further into the HSP (High Scoring Pairs) components of each hit. If this is the case you might want to look at the section of the HOWTO "Speed improvements with lightweight objects". Combining this with -best_hit_only can really speed up you code since the amount of data that has to be parsed for each query is significantly reduced.]

                    5. Repeat step #4 if you want information about multiple hits for each query (and remember not to use "-best_hit_only" when you create the reader object).

                    6. Repeat from step #3 for the next query sequence.

                    I understand that BioPerl can be hard to learn but the SearchIO module is a good place to start. The objects and methods are pretty much what the names imply and the data structures are a pretty straightforward hierarchy. Dive in and give it a try.

                    Comment


                    • #11
                      Thanks for the walk-through kmcarr, as well as the input on the HowTo. Diving in.

                      Hopefully this has been a help to Hilary as well.

                      Comment


                      • #12
                        Hi My question is not relavent Here anybody kindly Help. I have written a script using Bioperl module io::SearchIO for parsing blast result.
                        my $searchio = new Bio::SearchIO( -format => 'blast', -file => $blast_report, -best_hit_only => 1 );
                        This is my part of script but stil i am not ablet to get the best hit of every query.

                        Can anybody help me out.

                        With regards,
                        Pawan

                        Comment


                        • #13
                          Hi,
                          My question is not relavent Here anybody kindly Help. I have written a script using Bioperl module io::SearchIO for parsing blast result.
                          my $searchio = new Bio::SearchIO( -format => 'blast', -file => $blast_report, -best_hit_only => 1 );
                          This is my part of script but stil i am not ablet to get the best hit only of every query.

                          Can anybody help me out.

                          With regards,
                          Pawan

                          Comment


                          • #14
                            Hi,

                            I'm new to BioPerl myself, but you may have to put your 1 in quotes for the best_hit_only argument, or try this as kmcarr wrote in the message above instead of the way you wrote it:

                            -best_hit_only => 'true'

                            Comment


                            • #15
                              Getting sequence descriptions into BLAST output



                              Hi Grassgirl,
                              Thanks for reply but it does not work when i try. i wrote the code in this way:
                              my $searchio = new Bio::SearchIO( -format => 'blast', -file => $blast_report, -best_hit_only => 'true');

                              Eventhough,It gives error.

                              With regards,
                              Aeolus

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X