Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • script to calculate percentage identity for BLAT psl

    Hi,
    I am trying to calculate percentage identity for command line BALT output and I am posting the script below. But it is not working right now and the % is always 100%.
    Please let me know the problem. or any other way to caluculate %. This is being followed from BLAT FAQ documentation at http://genome.ucsc.edu/FAQ/FAQblat#blat4

    Code:
    while(<>) { 
            chomp $_; 
            my @v = split(/\t/,$_); 
            get_pid($_); 
    
    } 
    sub get_pid { 
             my @line = @_; 
             my $pid = (100.0 - (&pslCalcMilliBad(@line) * 0.1)); 
             print "The percentage: $pid\n"; 
             #return $pid; 
    } 
    
    sub pslCalcMilliBad { 
             my @cols = @_; 
    
             # sizeNul depens of dna/Prot 
             my $sizeMul = 1; 
             #if ($option{p}) { 
             #        $sizeMul = 3; 
             #} else { 
             #        $sizeMul = 1; 
             #} 
    
             # cols[0]  matches 
             # cols[1]  misMatches 
             # cols[2]  repMaches 
             # cols[4]  qNumInsert 
             # cols[6]  tNumInsert 
             # cols[11] qStart 
             # cols[12] qEnd 
             # cols[15] tStart 
             # cols[16] tEnd 
    
             my $qAliSize = $sizeMul * ($cols[12] - $cols[11]); 
             my $tAliSize = $cols[16] - $cols[15]; 
    
             # I want the minimum of qAliSize and tAliSize 
             my $aliSize; 
             $qAliSize < $tAliSize ? $aliSize = $qAliSize : $aliSize =   
    $tAliSize; 
    
             # return 0 is AliSize == 0 
             return 0 if ($aliSize <= 0); 
    
             # size diff 
             my $sizeDiff = $qAliSize - $tAliSize; 
             if ($sizeDiff < 0) { 
              $sizeDiff = 0; 
                     #if ($option{m}) { 
                             #$sizeDiff = 0; 
                     #} else { 
                             #$sizeDiff = -($sizeDiff); 
                     #} 
             } 
    
             # insert Factor 
             my $insertFactor = $cols[4]; 
             $insertFactor += $cols[6] unless ($option{m}); 
             my $milliBad = (1000 * ($cols[1]*$sizeMul + $insertFactor +   
    &round(3*log( 1 + $sizeDiff)))) / ($sizeMul * ($cols[0] + $cols[2] 
    + $cols[1])); 
                    return $milliBad; 
    
    } 
    
    sub round { 
             my $number = shift; 
             return int($number + .5); 
    }
    The changes I have donw ith respect to the FAQ code are

    Code:
    1
     my $sizeMul = 1; and commented: 
    #if ($option{p}) { 
             #        $sizeMul = 3; 
             #} else { 
             #        $sizeMul = 1; 
             #} 
    2
    if ($sizeDiff < 0) { 
              $sizeDiff = 0; 
                     #if ($option{m}) { 
                             #$sizeDiff = 0; 
                     #} else { 
                             #$sizeDiff = -($sizeDiff); 
                     #} 
             }
    It would be great if there are any other way of calculating % identity. Thanks in advance.

  • #2
    Any other way to calculate? Thanks.

    Comment


    • #3
      Why did you add the "&" in the function call?

      &pslCalcMilliBad(@line)

      Comment


      • #4
        can you try -out=blast9 in blat command-line and use the %identity column to parse? Makes it much simpler that way..
        --
        bioinfosm

        Comment


        • #5
          Originally posted by Chipper View Post
          Why did you add the "&" in the function call?

          &pslCalcMilliBad(@line)
          Why not? It is perfectly legitimate PERL code and, for some people, putting the ampersand makes it obvious that a function is being called.

          Now as far as why the original poster's code is not working. Hum, it is hard to debug without at least some of the data to the original code. Why don't you head or tail the first 5 lines of your input file (the one you get from BLAT.) That way we can make sure that this is simply not a case of GIGO.

          Comment


          • #6
            Yes, Thanks for the info. I finally output the result as blast9 in which the third column is %identity. But then when I compare the results from psl format and blast9 format, the number of output vary. Why is it so? I think in the blast9 format, only the output format changes.

            And hence I wanted to calculate the percentage identity for psl output which is as below:

            Code:
            match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes qStarts   tStarts
                    match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
            ---------------------------------------------------------------------------------------------------------------------------------------------------------------
            94      0       0       0       0       0       0       0       -       query1      180     0       94      chr18   90736837        37867797    37867891        1       94,     86,     37867797,
            85      0       0       0       1       1       0       0       -       query1      180     94      180     chr18   90736837        37867961    37868046        2       24,61,  0,25,   37867961,37867985,
            82      2       0       0       2       2       0       0       -       query1      180     94      180     chr18   90736837        37877482    37877566        3       24,43,17,       0,25,69,        37877482,37877506,37877549,
            61      0       0       0       1       1       0       0       -       query2      100     38      100     chr18   90736837        37868010    37868071        2       36,25,  0,37,   37868010,37868046,
            60      0       0       0       2       2       0       0       -       query2      100     38      100     chr18   90736837        37877531    37877591        3       18,17,25,       0,19,37,        37877531,37877549,37877566,
            Thanks.

            Comment


            • #7
              OK. I believe the problem is in the line that says:

              get_pid($_);

              Change this to:

              get_pid(@v);

              And the line will be passed properly to the other routines. When I do this I get, for the first three data lines:

              The percentage: 100
              The percentage: 96.4705882352941
              The percentage: 91.6666666666667

              Don't know if the numbers above are correct but at least they are not all 100.

              BTW: Writing your perl code with 'use warnings;' and 'use strict;' will tell you if undefined variables are being used. Which is what was happening in this case.

              Comment


              • #8
                Please confirm when you have the results if blast9 and psi actually give you different results from blat!

                I would never expect that; we can check with Jim Kent if thats the case.

                Thanks

                Originally posted by seq_GA View Post
                Yes, Thanks for the info. I finally output the result as blast9 in which the third column is %identity. But then when I compare the results from psl format and blast9 format, the number of output vary. Why is it so? I think in the blast9 format, only the output format changes.

                And hence I wanted to calculate the percentage identity for psl output which is as below:

                Code:
                match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes qStarts   tStarts
                        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
                ---------------------------------------------------------------------------------------------------------------------------------------------------------------
                94      0       0       0       0       0       0       0       -       query1      180     0       94      chr18   90736837        37867797    37867891        1       94,     86,     37867797,
                85      0       0       0       1       1       0       0       -       query1      180     94      180     chr18   90736837        37867961    37868046        2       24,61,  0,25,   37867961,37867985,
                82      2       0       0       2       2       0       0       -       query1      180     94      180     chr18   90736837        37877482    37877566        3       24,43,17,       0,25,69,        37877482,37877506,37877549,
                61      0       0       0       1       1       0       0       -       query2      100     38      100     chr18   90736837        37868010    37868071        2       36,25,  0,37,   37868010,37868046,
                60      0       0       0       2       2       0       0       -       query2      100     38      100     chr18   90736837        37877531    37877591        3       18,17,25,       0,19,37,        37877531,37877549,37877566,
                Thanks.
                --
                bioinfosm

                Comment


                • #9
                  Hi Westerman,
                  Thanks for identifying and now I am able to calculate the pecentage idetity.

                  Hi bioinfosm,

                  I won't say the results are different but the number of output lines are different. Sometimes for one query output in psl format has 2 corresponding blast9 outputs.

                  Yes the number of output lines (wc -l) for psl and blast9 output varies. I actually extracted only the results excluding the commented lines from blast9 output. In the blast9 output I get more results. I am not able to figure out the reason though I believe they output each block which overlap in psl output into separate lines or results. I will check that part and get back to you. When I look at the first line of psl output, it gices as
                  "psLayout Version3". Please let me know if you need furrther details and sorry if I have created any confusion.
                  Regards
                  Last edited by seq_GA; 03-19-2009, 06:59 AM.

                  Comment


                  • #10
                    Hi seq_GA,
                    Thanks for sharing your experience. Did your final code work? Could you paste your code here? This will certainly be of great help for newbies.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Advancing Precision Medicine for Rare Diseases in Children
                      by seqadmin




                      Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                      12-16-2024, 07:57 AM
                    • seqadmin
                      Recent Advances in Sequencing Technologies
                      by seqadmin



                      Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                      Long-Read Sequencing
                      Long-read sequencing has seen remarkable advancements,...
                      12-02-2024, 01:49 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 12-17-2024, 10:28 AM
                    0 responses
                    23 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-13-2024, 08:24 AM
                    0 responses
                    42 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-12-2024, 07:41 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-11-2024, 07:45 AM
                    0 responses
                    42 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X