Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parsing BLAST results using BioPerl

    Hi Everyone,

    I'm very much new to the world of sequence data analysis (and this forum!), and have reached a roadblock.

    I have BLASTed some contigs against a series of databases that I created. From this I would like to parse through the data and separate it before extracting the information of interest at a later point. I would like to separate the data by query ID. I found the following Bioperl script;

    Code:
    #!/usr/bin/perl
    
    use Bio::Search::Result::BlastResult;
    use Bio::SearchIO;
     
    my $report = Bio::SearchIO->new( -file=>'All_BCM_results.bls', -format => blast);
    my $result = $report->next_result;
    my %hits_by_query;
    while (my $hit = $result->next_hit) {
      push @{$hits_by_query{$hit->name}}, $hit;
    }
     
    foreach my $qid ( keys %hits_by_query ) {
      my $result = Bio::Search::Result::BlastResult->new();
      $result->add_hit($_) for ( @{$hits_by_query{$qid}} );
      my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format=>'blast' );
      $blio->write_result($result);
    }
    running this script resulted in the following error;

    BlastResult::new(): Not adding iterations.

    ------------- EXCEPTION: Bio::Root::NoSuchThing -------------
    MSG: No such iteration number: 0. Valid range=1-0
    VALUE: The number zero (0)
    STACK: Error::throw
    STACK: Bio::Root::Root::throw /sw/lib/perl5/5.8.8/Bio/Root/Root.pm:368
    STACK: Bio::Search::Result::BlastResult::iteration /sw/lib/perl5/5.8.8/Bio/Search/Result/BlastResult.pm:328
    STACK: Bio::Search::Result::BlastResult::add_hit /sw/lib/perl5/5.8.8/Bio/Search/Result/BlastResult.pm:258
    STACK: /Users/bsaville/Desktop/Parsing_BLAST_by_query.pl:15
    -------------------------------------------------------------

    So I added
    Code:
    my $result = Bio::Search::Result::BlastResult->new(1);
    The 1 to the line shown above, as it told me this was within the valid range. This produced the following error;

    ------------- EXCEPTION: Bio::Root::Exception -------------
    MSG: Must define arrayref of Iterations when initializing a Bio::Search::Result::BlastResult

    STACK: Error::throw
    STACK: Bio::Root::Root::throw /sw/lib/perl5/5.8.8/Bio/Root/Root.pm:368
    STACK: Bio::Search::Result::BlastResult::new /sw/lib/perl5/5.8.8/Bio/Search/Result/BlastResult.pm:128
    STACK: /Users/bsaville/Desktop/Parsing_BLAST_by_query.pl:14
    -----------------------------------------------------------

    I know that it is my inexperience that is causing this problem, but I really can't figure this out.

    If anyone has any better suggestions for parsing BLAST results, I'm open to ideas.

    Thanks
    Ben

  • #2
    You'd probably get a reply quickly on the BioPerl mailing list.

    Update:
    Last edited by maubp; 08-20-2010, 05:48 AM. Reason: Adding link to email on BioPerl mailing list

    Comment


    • #3
      The key to debugging is usually to start with something that works and make subtle changes until it doesn't.

      So either:
      1. find some blast output that works with this script and see how it differs from yours, or
      2. simplify the script to its most basic components and use print statements to verify it is parsing your blast output as expected
      --
      Jeremy Leipzig
      Bioinformatics Programmer
      --
      My blog
      Twitter

      Comment


      • #4
        If anyone has any better suggestions for parsing BLAST results, I'm open to ideas.
        There are somewhere around 12 different Blast output formats. Feeding the wrong type to your script could cause it to blow up.

        My suggestion to newbies who are not doing much blast work is to use the blast format '-m 9' (tabular with comment lines) when doing their blast searches. This will allow you to import the work into a spreadsheet. Or, if you are unix-savvy, use the tools 'cut', 'paste', 'grep', 'uniq', etc. to obtain what you want.

        The default blast output (-m 0 ... pairwise) is almost impossible to parse correctly.

        XML output (-m 7) can be useful if you like XML.

        Comment


        • #5
          I'm at the same institute as Ben - rather than use BioPerl (the Blast writer section has always been a bit of a pain to get working correctly) we hashed together a script that would do what he wanted (splitting a multiblast file into smaller blast output files named by the blast queryname):
          Code:
          #!/usr/bin/perl
          
          use strict;
          use warnings;
          
          open(BLAST,"All_All_results.bls");
          my @data = <BLAST>;
          close BLAST;
          
          my @querylines = ();
          for (my $i=0; $i<@data; $i++) {
              my $line = $data[$i];
              if ($line =~ m/Query=/g) {
                  push(@querylines,$i);
              }
          }
          
          my @queries = ();
          for (my $j=1; $j<@querylines; $j++) {
              my $endline = $querylines[$j];
              $endline--;
              my $startline = $querylines[$j-1];
              my @result = ();
              for (my $k=$startline; $k<=$endline; $k++) {
                  push(@result,$data[$k]);
              }
              push(@queries,[@result]);
          }
          
          my $numqueries = scalar(@queries);
          for (my $m=0; $m<$numqueries; $m++) {
              my $firstline = $queries[$m][0];
              my @linearray = split(' ',$firstline);
              my $name = $linearray[1];
              my @array = @{$queries[$m]};
              open(OUT,">$name.bls");
              foreach my $line (@array) {
                  print OUT $line;
              }   
              close OUT;    
          }
          It's based on the standard Blast+ blastn output.

          I'm sure there are more efficient perl ways of doing it, but I hope that helps anyone else who comes up with the same problem
          Last edited by rglover; 08-20-2010, 07:46 AM.

          Comment


          • #6
            Ben,

            Could you please clarify something for me.

            Is the input BLAST file the result of a single BLAST run? In other words are you just trying to break up a single BLAST report into separate files, one corresponding to each query?

            OR

            Did you first concatenate several BLAST reports together, and were those BLAST reports the result of searching with the same set of query sequences against various target databases? If this is the case is your desire to aggregate all of the results for each query into a single report file?

            Comment


            • #7
              Rachel was a little over generous saying that 'we' wrote the perl file, I just watched really.

              I set about this in completely the wrong way, I had several Multi-FASTA files to be BLASTed against each other. Originally I had 5 db's and 5 FASTA files and I BLASTed them all against each other.

              Then I realised that I could produce 5 larger BLAST reports by combining all of the FASTA sequences into one file. Then just before going to speak with Rachel I realised that I could create a database using my combined FASTA file too. So in the end I produced just the one BLAST report. This was then successfully parsed with the script shown above, and broken down by query ID into files using the query ID as the filename.

              The input blast report was produced using default -blastn settings on NCBI's blast++ command line program.

              Comment


              • #8
                Ben,

                Combining the queries and making a single database is definitely the best way to go. Having a single output file, with each query represented only once, makes the task relatively simple.

                I know you and Rachel solved the problem but since you originally asked about using BioPerl I though I would throw together a demo script.

                Code:
                #!/usr/bin/perl
                
                ###########################
                # bp_simpleSplitBlast.pl
                # 
                # Simple script demonstrating spliting a multi query BLAST report
                # into individual query reports using the Bioperl SearchIO module.
                ###########################
                
                # Standard pragmas.
                
                use strict;
                use warnings;
                
                
                # The SearchIO module is used for both reading and writing the BLAST report files.
                # The Writer::TextResultWriter is used by the output SearchIO object to format the output
                # before writing.
                
                use Bio::SearchIO;
                use Bio::SearchIO::Writer::TextResultWriter;
                
                
                # Input BLAST report filename is passed as the only argument on the command line.
                
                my $fileName = $ARGV[0];
                
                
                # Instantiate a SearchIO object to read the input file.
                # The arguments are the file format and filename.
                
                my $blastReport = Bio::SearchIO->new(-format => 'blast',
                                                     -file => $fileName);
                
                
                # Instantiate a Writer object which will be used to format the output stream.
                # This object will be reused by each output writer.
                
                my $reportWriter = Bio::SearchIO::Writer::TextResultWriter->new();
                
                
                # Parse the input BLAST report one result object at a time.
                # The Bio::Search::Result::ResultI object exactly corresponds to one query,
                # which is the desired output.
                # For each result the query name is read to create unique file names.
                # A new SearchIO object is instantiated for each output,
                # passing the Writer object and file name.
                
                while (my $blastResult = $blastReport->next_result) {
                	my $qid = $blastResult->query_name;
                	my $reportFile = "${qid}.bls";
                	my $reportOutput = Bio::SearchIO->new(-writer => $reportWriter,
                                                             -file => ">$reportFile");
                	$reportOutput->write_result($blastResult);
                }

                Comment


                • #9
                  Thanks Km,

                  I like to think that I'm not the only person who has / will fall at this hurdle and I hope that this thread will help others in the future!

                  I will be sure to try out your Bioperl script next time I'm parsing BLAST output.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X