Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to identify blast hits with only 1 HSP (not limit the number of HSP)

    Hi,

    Is there a way to identify blast hits where there was only one HSP to the database?

    i.e. I don't want to limit the number of HSP's displayed, but I want to identify all queries which only had one HSP to the database to begin with in the blast result.

    Maybe there is an accessory script somewhere that parses a blast report and identify the queries with this property?

    thanks

  • #2
    If I understand what your after correctly then I do the same using this..

    cat BlastOutputFile | grep -B 3 -A 1 "# 1 hits found"

    When BlastOutputFile is in outformat 7. Obviously if your using Xml output or any other format it'll be different.

    Comment


    • #3
      Originally posted by whataBamBam View Post
      If I understand what your after correctly then I do the same using this..

      cat BlastOutputFile | grep -B 3 -A 1 "# 1 hits found"

      When BlastOutputFile is in outformat 7. Obviously if your using Xml output or any other format it'll be different.
      Thanks bambam, but that's not what I'm after.

      Yes that will pull out those with 1 match, but what I'm after is those that have only have one alignment 'section' (i.e. high-scoring segment pair).

      e.g. an isoform could match to only one target in the database, but half of it could be aligning to to the 5' end, then a gap, then the other half the 3' end of the target.

      I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).

      Comment


      • #4
        Originally posted by Kennels View Post

        I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).
        So you mean you want to parse for full-length hsp?

        Comment


        • #5
          Yes and no.
          Also looking for partial alignment, but only those with single hsp.

          Comment


          • #6
            I usually use BioPerl for something like this. The following will get you close (the formatting of the output blast file generated by Bio::SearchIO::Writer::TextResultWriter is slightly different, but basically the same).

            Usage would be:
            perl script.pl --blast inputBlastFile.txt --out reducedBlastFile.txt

            Code:
            #!/usr/bin/perl
            
            use strict;
            use warnings;
            use Getopt::Long;
            use Bio::SearchIO;
            use Bio::SearchIO::Writer::TextResultWriter;
            
            my $blastFile;
            my $outFile;
            
            GetOptions("blast=s" => \$blastFile,
                       "out=s"   => \$outFile) || die "Couldn't get options with GetOpt::Long.\n";
            
            my $blastIO = Bio::SearchIO->new(-file => $blastFile,
                                             -format => 'blast'); #you might need to change this
            
            my $searchIOWriter = Bio::SearchIO::Writer::TextResultWriter->new();
            my $out = Bio::SearchIO->new(-writer => $searchIOWriter,
                                         -file => '>outtest.txt');
            
            while (my $result = $blastIO->next_result) {
                if ($result->num_hits() == 1) {
                    my $hit = $result->next_hit();
                    if ($hit->num_hsps() == 1) {
                        print "A single HSP for a single hit found for " . $result->query_name() . "\n";
                        $out->write_result($result);
                    }
                }
            }

            Comment


            • #7
              Thanks I'll give it a try.

              Comment


              • #8
                if you used the XML output of blast, you could use the following XSLT stylesheet:

                Code:
                <?xml version='1.0'  encoding="ISO-8859-1"?>
                <xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' >
                <xsl:output method="text"/>
                <xsl:template match="/">
                <xsl:apply-templates select="BlastOutput/BlastOutput_iterations/Iteration/Iteration_hits/Hit"/>
                </xsl:template>
                <xsl:template match="Hit">
                <xsl:if test="count(Hit_hsps/Hsp) = 1">
                <xsl:value-of select="Hit_id"/><xsl:text>	</xsl:text>
                <xsl:value-of select="Hit_def"/><xsl:text>	</xsl:text>
                <xsl:value-of select="Hit_accession"/><xsl:text>
                </xsl:text>
                </xsl:if>
                </xsl:template>
                </xsl:stylesheet>
                usage:

                Code:
                $ xsltproc --novalid  stylesheet.xsl  blast.xml

                Comment


                • #9
                  Originally posted by Kennels View Post
                  Thanks bambam, but that's not what I'm after.

                  Yes that will pull out those with 1 match, but what I'm after is those that have only have one alignment 'section' (i.e. high-scoring segment pair).

                  e.g. an isoform could match to only one target in the database, but half of it could be aligning to to the 5' end, then a gap, then the other half the 3' end of the target.

                  I'm after hits which aligns without any interruptions, i.e. only 1 HSP. Blastn options don't seem to do what I want (e.g. -ungapped, -culling_limit, -max_hsps_per_subject just mask the information).
                  Ah sorry I get you. Yes that would be really useful. I'll have a look at the script posted and see what that's doing.

                  I think Gmap will do something like that.. it isn't really designed to match pairs of transcript length sequences (it's designed to map a transcript length sequence to a genome) but I actually do use it to map transcripts to ESTs and it seems to work really well. I then filter the output based on length of the alignment and can find the matches where the EST aligned along it's whole length without any significant gaps. Which is probably quite similar to what your trying to do.

                  Possibly I really shouldn't be using Gmap to do this so if someone can tell me that then great because I actually do it all the time.

                  Comment


                  • #10
                    Originally posted by atcghelix View Post
                    I usually use BioPerl for something like this. The following will get you close (the formatting of the output blast file generated by Bio::SearchIO::Writer::TextResultWriter is slightly different, but basically the same).
                    Just letting you know the script did what I wanted.
                    I just modified the script a little:

                    Code:
                    #!/usr/bin/perl
                    
                    use strict;
                    use warnings;
                    use Getopt::Long;
                    use Bio::SearchIO;
                    use Bio::SearchIO::Writer::TextResultWriter;
                    
                    my $blastFile = $ARGV[0];     ## added one argument
                    my $outFile;
                    
                    GetOptions("blast=s" => \$blastFile,
                               "out=s"   => \$outFile) || die "Couldn't get options with GetOpt::Long.\n";
                    
                    my $blastIO = Bio::SearchIO->new(-file => $blastFile,
                                                     -format => 'blasttable'); ### changed from 'blast'
                    
                    my $searchIOWriter = Bio::SearchIO::Writer::TextResultWriter->new();
                    my $out = Bio::SearchIO->new(-writer => $searchIOWriter,
                                                 -file => '>outtest.txt');
                    
                    while (my $result = $blastIO->next_result) {
                        if ($result->num_hits() == 1) {
                            my $hit = $result->next_hit();
                            if ($hit->num_hsps() == 1) {
                                
                    	print $result->query_name() . "\n";   ## just print the result ID
                                $out->write_result($result);
                            }
                        }
                    }
                    Since my output was in -outfmt 6 from blastn, I changed the format to 'blasttable'
                    Also I just wanted the ID of the query for single HSP, so I just modified the stdout and redirect to a file to get my list of IDs.
                    I ran as:

                    Code:
                    parse_blastreport_single_hsp.pl Result.blastn6 > singleHSPs.txt

                    For others, also need to make sure the output has '-max_target_seqs' in blastn is set to more than 1.

                    Many thanks!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X