Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • script to calculate percentage identity for BLAT psl

    Hi,
    I am trying to calculate percentage identity for command line BALT output and I am posting the script below. But it is not working right now and the % is always 100%.
    Please let me know the problem. or any other way to caluculate %. This is being followed from BLAT FAQ documentation at http://genome.ucsc.edu/FAQ/FAQblat#blat4

    Code:
    while(<>) { 
            chomp $_; 
            my @v = split(/\t/,$_); 
            get_pid($_); 
    
    } 
    sub get_pid { 
             my @line = @_; 
             my $pid = (100.0 - (&pslCalcMilliBad(@line) * 0.1)); 
             print "The percentage: $pid\n"; 
             #return $pid; 
    } 
    
    sub pslCalcMilliBad { 
             my @cols = @_; 
    
             # sizeNul depens of dna/Prot 
             my $sizeMul = 1; 
             #if ($option{p}) { 
             #        $sizeMul = 3; 
             #} else { 
             #        $sizeMul = 1; 
             #} 
    
             # cols[0]  matches 
             # cols[1]  misMatches 
             # cols[2]  repMaches 
             # cols[4]  qNumInsert 
             # cols[6]  tNumInsert 
             # cols[11] qStart 
             # cols[12] qEnd 
             # cols[15] tStart 
             # cols[16] tEnd 
    
             my $qAliSize = $sizeMul * ($cols[12] - $cols[11]); 
             my $tAliSize = $cols[16] - $cols[15]; 
    
             # I want the minimum of qAliSize and tAliSize 
             my $aliSize; 
             $qAliSize < $tAliSize ? $aliSize = $qAliSize : $aliSize =   
    $tAliSize; 
    
             # return 0 is AliSize == 0 
             return 0 if ($aliSize <= 0); 
    
             # size diff 
             my $sizeDiff = $qAliSize - $tAliSize; 
             if ($sizeDiff < 0) { 
              $sizeDiff = 0; 
                     #if ($option{m}) { 
                             #$sizeDiff = 0; 
                     #} else { 
                             #$sizeDiff = -($sizeDiff); 
                     #} 
             } 
    
             # insert Factor 
             my $insertFactor = $cols[4]; 
             $insertFactor += $cols[6] unless ($option{m}); 
             my $milliBad = (1000 * ($cols[1]*$sizeMul + $insertFactor +   
    &round(3*log( 1 + $sizeDiff)))) / ($sizeMul * ($cols[0] + $cols[2] 
    + $cols[1])); 
                    return $milliBad; 
    
    } 
    
    sub round { 
             my $number = shift; 
             return int($number + .5); 
    }
    The changes I have donw ith respect to the FAQ code are

    Code:
    1
     my $sizeMul = 1; and commented: 
    #if ($option{p}) { 
             #        $sizeMul = 3; 
             #} else { 
             #        $sizeMul = 1; 
             #} 
    2
    if ($sizeDiff < 0) { 
              $sizeDiff = 0; 
                     #if ($option{m}) { 
                             #$sizeDiff = 0; 
                     #} else { 
                             #$sizeDiff = -($sizeDiff); 
                     #} 
             }
    It would be great if there are any other way of calculating % identity. Thanks in advance.

  • #2
    Any other way to calculate? Thanks.

    Comment


    • #3
      Why did you add the "&" in the function call?

      &pslCalcMilliBad(@line)

      Comment


      • #4
        can you try -out=blast9 in blat command-line and use the %identity column to parse? Makes it much simpler that way..
        --
        bioinfosm

        Comment


        • #5
          Originally posted by Chipper View Post
          Why did you add the "&" in the function call?

          &pslCalcMilliBad(@line)
          Why not? It is perfectly legitimate PERL code and, for some people, putting the ampersand makes it obvious that a function is being called.

          Now as far as why the original poster's code is not working. Hum, it is hard to debug without at least some of the data to the original code. Why don't you head or tail the first 5 lines of your input file (the one you get from BLAT.) That way we can make sure that this is simply not a case of GIGO.

          Comment


          • #6
            Yes, Thanks for the info. I finally output the result as blast9 in which the third column is %identity. But then when I compare the results from psl format and blast9 format, the number of output vary. Why is it so? I think in the blast9 format, only the output format changes.

            And hence I wanted to calculate the percentage identity for psl output which is as below:

            Code:
            match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes qStarts   tStarts
                    match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
            ---------------------------------------------------------------------------------------------------------------------------------------------------------------
            94      0       0       0       0       0       0       0       -       query1      180     0       94      chr18   90736837        37867797    37867891        1       94,     86,     37867797,
            85      0       0       0       1       1       0       0       -       query1      180     94      180     chr18   90736837        37867961    37868046        2       24,61,  0,25,   37867961,37867985,
            82      2       0       0       2       2       0       0       -       query1      180     94      180     chr18   90736837        37877482    37877566        3       24,43,17,       0,25,69,        37877482,37877506,37877549,
            61      0       0       0       1       1       0       0       -       query2      100     38      100     chr18   90736837        37868010    37868071        2       36,25,  0,37,   37868010,37868046,
            60      0       0       0       2       2       0       0       -       query2      100     38      100     chr18   90736837        37877531    37877591        3       18,17,25,       0,19,37,        37877531,37877549,37877566,
            Thanks.

            Comment


            • #7
              OK. I believe the problem is in the line that says:

              get_pid($_);

              Change this to:

              get_pid(@v);

              And the line will be passed properly to the other routines. When I do this I get, for the first three data lines:

              The percentage: 100
              The percentage: 96.4705882352941
              The percentage: 91.6666666666667

              Don't know if the numbers above are correct but at least they are not all 100.

              BTW: Writing your perl code with 'use warnings;' and 'use strict;' will tell you if undefined variables are being used. Which is what was happening in this case.

              Comment


              • #8
                Please confirm when you have the results if blast9 and psi actually give you different results from blat!

                I would never expect that; we can check with Jim Kent if thats the case.

                Thanks

                Originally posted by seq_GA View Post
                Yes, Thanks for the info. I finally output the result as blast9 in which the third column is %identity. But then when I compare the results from psl format and blast9 format, the number of output vary. Why is it so? I think in the blast9 format, only the output format changes.

                And hence I wanted to calculate the percentage identity for psl output which is as below:

                Code:
                match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes qStarts   tStarts
                        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
                ---------------------------------------------------------------------------------------------------------------------------------------------------------------
                94      0       0       0       0       0       0       0       -       query1      180     0       94      chr18   90736837        37867797    37867891        1       94,     86,     37867797,
                85      0       0       0       1       1       0       0       -       query1      180     94      180     chr18   90736837        37867961    37868046        2       24,61,  0,25,   37867961,37867985,
                82      2       0       0       2       2       0       0       -       query1      180     94      180     chr18   90736837        37877482    37877566        3       24,43,17,       0,25,69,        37877482,37877506,37877549,
                61      0       0       0       1       1       0       0       -       query2      100     38      100     chr18   90736837        37868010    37868071        2       36,25,  0,37,   37868010,37868046,
                60      0       0       0       2       2       0       0       -       query2      100     38      100     chr18   90736837        37877531    37877591        3       18,17,25,       0,19,37,        37877531,37877549,37877566,
                Thanks.
                --
                bioinfosm

                Comment


                • #9
                  Hi Westerman,
                  Thanks for identifying and now I am able to calculate the pecentage idetity.

                  Hi bioinfosm,

                  I won't say the results are different but the number of output lines are different. Sometimes for one query output in psl format has 2 corresponding blast9 outputs.

                  Yes the number of output lines (wc -l) for psl and blast9 output varies. I actually extracted only the results excluding the commented lines from blast9 output. In the blast9 output I get more results. I am not able to figure out the reason though I believe they output each block which overlap in psl output into separate lines or results. I will check that part and get back to you. When I look at the first line of psl output, it gices as
                  "psLayout Version3". Please let me know if you need furrther details and sorry if I have created any confusion.
                  Regards
                  Last edited by seq_GA; 03-19-2009, 06:59 AM.

                  Comment


                  • #10
                    Hi seq_GA,
                    Thanks for sharing your experience. Did your final code work? Could you paste your code here? This will certainly be of great help for newbies.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X