Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • script to calculate A,T,G,C frequency for each position in an alingment

    Dear all,
    I am new in programming. I wrote a script to calculate the A,T,G,C frequency for each position in an alignment. Take the following alignment for example, in the position 1 the A frequency is 33.3% and G freqeuncy is 66.6%.

    Code:
    ATGCATGC
    GGTCACGT
    GTAGCGTA
    This script is based on others' suggestions, using hash of hash. However, I am in trouble to debug it. Would you please help to indicate where are my errors? THANK YOU VERY MUCH.

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    open SEQ, $ARGV[0];
    my %hash = 0;
    while (<SEQ>){
    chomp;
    my @seq = split (/ /, $_);
    	for (my $i=0; $i<=$#seq; $i++){
    	$hash {$i}{$seq[$i]} +=1;
    	}
    	}
    close SEQ;
    	foreach my $posi (keys %hash){
    		foreach my $nt ('A', 'T', 'G', 'C'){
    	print "$posi", "$nt", %hash{$posi}{$nt}/4, "\n";
    		}
    	}

  • #2
    Okay, I'll just standardise your code a little to make it a bit easier for me to understand:
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my %hash = (); # initialise hash
    while (<>){ # read a line from standard input (or files specified on command-line)
      chomp; # delete end of line character (if any)
      my @seq = split (/ /, $_); # split line into array, delimiting by a single space
      for (my $i=0; $i<=$#seq; $i++){ # for each index of the array
        $hash {$i}{$seq[$i]} +=1; # increment the value in the hash array at position (i, <base>)
      }
    }
    
    foreach my $posi (keys %hash){ # for the keys in the first hash element (indexes)
      foreach my $nt ('A', 'T', 'G', 'C'){ # for the bases A/T/G/C
        print "$posi", "$nt", %hash{$posi}{$nt}/4, "\n"; # print the location, then the base, then the value at (<loc>,<base>) divided by 4
      }
    }
    Given that you've specified use warnings and use strict, Perl should tell you what you've done wrong syntax-wise. What errors are you getting when you run the code? Alternatively, why do you think it's not working?

    I see three potential issues:
    • You're dividing by 4 rather than the number of lines read
    • The hash value does not necessarily exist when printing
    • You're using %hash{a}{b} in the print, rather than $hash{a}{b} [Perl should warn you about this and suggest the correct use]


    Here's my suggested modification based on theory (i.e. untested):

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my %hash = (); # initialise hash
    my $lineCount = 0; # initialise line count
    while (<>){ # read a line from standard input (or files specified on command-line)
      $lineCount++; # increment lines read
      chomp; # delete end of line character (if any)
      my @seq = split (/ /, $_); # split line into array, delimiting by a single space
      for (my $i=0; $i<=$#seq; $i++){ # for each index of the array
        $hash{$i}{$seq[$i]} +=1; # increment the value in the hash array at position (i, <base>)
      }
    }
    
    foreach my $posi (sort {$a <=> $b} (keys %hash)){ # sort the keys numerically
      foreach my $nt ('A', 'T', 'G', 'C'){ # for the bases A/T/G/C
        print("$posi", "$nt", ($hash{$posi}{$nt})/($lineCount), "\n") if(defined($hash{$posi}{$nt}));
      }
    }
    Yes, there are other optimisations that can be done, but your code was good enough in terms of readability and I don't think you need to worry over that for this script.
    Last edited by gringer; 12-02-2013, 12:40 PM. Reason: corrected last if statement

    Comment


    • #3
      Dear gringer,

      thank you very much for your detailed corrections. I am new in perl programming, especially the perl grammer. My script is based on others' advices, but i had troubles in debugging.

      Your script make me better understand perl. However, when adding a "(" before "if", i have not got any output. Would you check it please? THANKS.

      my input file is as follows:
      ATGCACTGACTGTATGACTG
      ATGGTGACTGTGACTGACTG
      ATGGACCATGACTGCATGTG
      ATCCACTGTGACGTGCAACA

      Comment


      • #4
        The last line of your script lacks a "(". After adding it, i found no error message, but i did not get any output. Could you check it please? THANKS.

        Comment


        • #5
          I changed it to 'defined'. If there's still no output, it suggests that the hash isn't being referenced appropriately, in which case you can explicitly iterate through the second-level keys:

          Code:
          foreach my $posi (sort {$a <=> $b} (keys %hash)){ # sort the keys numerically
            foreach my $nt (keys %{$hash{$posi}}){ # for the bases A/T/G/C
              print("$posi", "$nt", ($hash{$posi}{$nt})/($lineCount), "\n");
            }
          }
          If still no output, the hash probably isn't being created at all, which is a little odd....

          Comment


          • #6
            Hi, gringer, there still exists problems. My input file is:

            ATGCACTGACTGTATGACTG
            ATGGTGACTGTGACTGACTG
            ATGGACCATGACTGCATGTG
            ATCCACTGTGACGTGCAACA

            my output file is:
            0.25CACTGACTGTATGACTG
            0.25GACCATGACTGCATGTG
            0.25CACTGTGACGTGCAACA
            0.25GTGACTGTGACTGACTG

            When you have time please check again. Anyway i thank you for your help. I really learned somthing from your script and believe it is very very close. I need to stick in perl and continue learning something about hash in perl. THANKS!

            Comment


            • #7
              Originally posted by pony2001mx View Post
              Hi, gringer, there still exists problems. My input file is:

              ATGCACTGACTGTATGACTG
              ATGGTGACTGTGACTGACTG
              ATGGACCATGACTGCATGTG
              ATCCACTGTGACGTGCAACA

              my output file is:
              0.25CACTGACTGTATGACTG
              0.25GACCATGACTGCATGTG
              0.25CACTGTGACGTGCAACA
              0.25GTGACTGTGACTGACTG

              When you have time please check again. Anyway i thank you for your help. I really learned somthing from your script and believe it is very very close. I need to stick in perl and continue learning something about hash in perl. THANKS!
              From your output, I'm pretty sure the problem is in this statement:
              PHP Code:
              my @seq split (//, $_); #DO NOT put any space in between "//" 

              Comment


              • #8
                Yes, sorry, I forgot about the space separator. That demonstrates why example input/output and a "this is wrong because..." explanation makes the whole problem solving process much easier.

                Comment


                • #9
                  Thank you both. This script works well. Personally I will continue learning perl, which is powerful for genomic analysis. THANKS.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  7 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  7 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  66 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X