Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Array of arrays in perl

    Hi everyone,

    I'm trying to write a perl programme that will split up a .fasta header:

    gi|4140243|dbj|AB022087.1|_Xenopus_laevis_mRNA_for_cytochrome_P450,_complete_cds,_clone_MC1

    Into its constituent parts:
    gi
    4140243
    dbj
    AB022087.1
    _Xenopus_laevis_mRNA_for_cytochrome_P450,_complete_cds,_clone_MC1

    This is easily achieved using:

    my @hits = split('\|', $hits);

    my ($gi, $number, $gb, $id, $name) = '';
    foreach (@hits) {
    $gi.= "$hits[0]\n";
    $number .= "$hits[1]\n";
    $gb .= "$hits[2]\n";
    $id .= "$hits[3]\n";
    $name .= "$hits[4]\n";
    }

    my @gi = split('\n', $gi);
    my @number = split('\n', $number);
    my @gb = split('\n', $gb);
    my @id = split('\n', $id);
    my @name = split('\n', $name);


    Now each part of each header (contained in $hits) is an element in an individual array. What I want to do next is print back element[0] for each array so that I can reproduce the original $hits (which I then want to modify).
    I'm unsure as to whether this will require a hash of hashes or array of arrays.

    I'm fairly new to perl to any suggestions would be greatly helpful.

    I'm also aware that the above might not be the slickest way of achieving what I want - again any comments would be great!

    Many thanks,

    N

  • #2
    Hey, I amn't a Perl superstar but have been using it a while and I think the hash of arrays is good for this: basically you want one key to represent each array of elements, those arrays being the component parts of a fasta header. The hash is something you should learn, its a nice system=)

    So a hash of arrays is:
    @{$hash{$key}}

    So for your example:

    Code:
    #line to split: gi|4140243|dbj|AB022087.1|_Xenopus_laevis_mRNA_for_cytochrome_P450,_complete_cds,_clone_MC1
    
    #initialise the hash:
    my %hash;
    
    #the names you want to call your hash keys as an array:
    
    my @names = ("gi", "number", "gb", "id", "name");
    
    #split header
    
    my @sp = split(/\|/, $line);
    
    #use a for loop to index each element. These are useful!
    
    my $length = @sp;
    for (my $i=0; $i < $length; $i++){
       chomp $sp[$i]; #chomp
       chomp $names[$i]; #chomp
       push(@{$hash{$names[$i]}},$sp[$i]); #use push because its an array, but push $sp[$i] onto the %hash, with key=$names[$i].
    }
    
    #then to return the element you desire you iterate over the hash and use a match for your names:
    
    foreach my $keys (sort keys %hash){ #iterate over key names
       chomp $keys
       for (my $i=0; $i < @{$hash{$keys}}; $i++){
          if ($keys eq $names[1]){ #amend this to match any of your elements in @names to get what you want out
             print "$hash{$keys}[$i]\n";
          }
       }
    }

    Comment


    • #3
      I *am* a perl expert and both of your codes are giving me a headache. bruce01 -- I think you are making this way more complex than nr23 wanted. It seems to me that nr23 just wanted each individual part of the fasta header for a *single* header. You, bruce01, seem to be wanting to save *all* parts of *all* headers.

      Going to what I think nr23 wants, this code should be simplified to:

      chomp $hits;
      my ($gi, $number, $gb, $id, $name) = split('\|', $hits);

      You can then work with the individual variables as you want and eventually put them back together via:

      my $s = join '|', ($gi, $number, $gb, $id, $name);

      print $s . "\n"; # Or whatever you want to do with it.

      Of course this being perl there is more one way to do things ... but let's try the simple method first, eh?

      Comment


      • #4
        Array of arrays in perl

        Originally posted by bruce01 View Post
        Code:
        
        
        #then to return the element you desire you iterate over the hash and use a match for your names:
        
        foreach my $keys (sort keys %hash){ #iterate over key names
           chomp $keys
           for (my $i=0; $i < @{$hash{$keys}}; $i++){
              if ($keys eq $names[1]){ #amend this to match any of your elements in @names to get what you want out
                 print "$hash{$keys}[$i]\n";
              }
           }
        }
        the last part makes more sense if you have the if statement before the
        for loop:

        Code:
        foreach my $keys (sort keys %hash){ #iterate over key names
           chomp $keys;
           if ($keys eq $names[1]){ #amend this to match any of your elements in             @names to get what you want out
               for (my $i=0; $i < @{$hash{$keys}}; $i++){
                     print "$hash{$keys}[$i]\n";
              }
           }
        }
        maria

        Comment


        • #5
          As for bruce01's code. Aside from not doing what I think nr23 wants, your code does redundant work plus is vulnerable to out-of-bound conditions. Looking at this section:

          [CODE]
          my $length = @sp;
          for (my $i=0; $i < $length; $i++){
          chomp $sp[$i]; #chomp
          chomp $names[$i]; #chomp
          push(@{$hash{$names[$i]}},$sp[$i]); #use push because its an array, but push $sp[$i] onto the %hash, with key=$names[$i].
          }
          [CODE]

          First, there is no reason to chomp @names over and over. In fact no reason to do it anyway since there are zero new-lines in @names.

          Second, what happens if there are more entries in @sp than in @names? Answer: You'll try to be using an undef as a hash key. Not good.

          I would re-write this to look like -- and assuming that $hits (and thus @sp) was already chomped;

          for (my $i=0; $i < $#names; $i++) {
          push @{$hash{$names[$i]}}, $sp[$i];
          }

          Undef @sp entries will be ok, in case $#sp != $#names.

          Once again, more than one way to do things.

          Comment


          • #6
            Thanks guys for the suggestions.

            Westerman - sorry I should have been clearer: I actually DO want to isolate all parts of all headers so that I can for example modify (or delete) individual elements.

            I'm sure that my code is amateurish at best - if you could provide suggestions to my input I would be grateful!

            N

            Comment


            • #7
              The question then becomes, "what form do you want your data structure to be?". Since you want all parts for all headers (and are then presumably working on what would be called 'slices' of the data -- e.g., all 'gb' entries at a time) with the ability to put back together a given header, then you'll need an array-of-arrays, a hash-of-hashes or, as bruce01 suggested, a hash-of-arrays. Given that you have only 5 parts you could even get by with just 5 separate arrays. I suspect that the latter method would be the most simple, although perhaps not as elegant, way to do the processing. In which case you were almost there except for where to place your array initialization and the unneeded chomps.

              --------------

              my @gi = ();
              my @number = ();
              my @id = ();
              my @name = ();
              my @hits = ();

              # Loop around somehow getting $hit for each header line

              chomp $hit;
              my @splitted = split /\|/, $hit;
              push @gi, $splitted[0];
              push @number, $splitted[1];
              push @id, $splitted[2];
              push @name, $splitted[3];
              push @hits, $splitted[4];

              # End of loop

              At the end you have 5 arrays with your parts in order. bruce01's method produces a single hash holding the 5 arrays but that seems too complex to me.

              Comment


              • #8
                I would vote for bruce01's suggestion. Although a hash-of-arrays might seem more complex to initially create, it allows you to have all of the data in a single structure.

                This makes it easier to expand; if you needed a new field it becomes a new hash key*—*no need to declare a new array. Also, if you ever need to send this data to a subroutine, you only need to pass a single reference to the entire hash (rather than 5 separate array references).

                Comment


                • #9
                  Great - thanks.

                  What I really want to do is change the field i've called 'id' for each header. I would do this by looping through each element in the id array and ++$some_counter and reassemble each header so this:

                  gi|258677212|gb|FJ791250.1|_Xenopus_laevis_centromere_protein_C_(CENPC1)_mRNA,_complete_cds
                  gi|255926670|gb|GQ370808.1|_Xenopus_laevis_nephrosis_2_(Nphs2)_mRNA,_partial_cds

                  Would be transformed to this:

                  gi|258677212|gb|FJ791250.1:1|_Xenopus_laevis_centromere_protein_C_(CENPC1)_mRNA,_complete_cds
                  gi|255926670|gb|GQ370808.1:2|_Xenopus_laevis_nephrosis_2_(Nphs2)_mRNA,_partial_cds

                  Assuming those are the 1st and 2nd headers. Having 5 separate arrays (as suggested by western) was my initial thought, but I like the idea of getting my head around hashes so will follow bruce01's suggestion.

                  I understand the 1st part:

                  for (my $i=0; $i < $#names; $i++) {
                  push @{$hash{$names[$i]}}, $split[$i];

                  but am unsure about the 2nd:

                  foreach my $keys (sort keys %hash){
                  chomp $keys;
                  if ($keys eq $names[1]){
                  for (my $i=0; $i < @{$hash{$keys}}; $i++){
                  print "$hash{$keys}[$i]\n";
                  }
                  }
                  }

                  How do the keys get sorted in the proper order ("gi", "number", "gb", "id", "name") as opposed to randomly?

                  Why a 'for' loop rather than foreach @{$hash{$keys}} ?

                  Thanks!

                  N

                  Comment


                  • #10
                    Array of arrays in perl

                    Originally posted by nr23 View Post

                    but am unsure about the 2nd:

                    foreach my $keys (sort keys %hash){
                    chomp $keys;
                    if ($keys eq $names[1]){
                    for (my $i=0; $i < @{$hash{$keys}}; $i++){
                    print "$hash{$keys}[$i]\n";
                    }
                    }
                    }

                    How do the keys get sorted in the proper order ("gi", "number", "gb", "id", "name") as opposed to randomly?

                    Why a 'for' loop rather than foreach @{$hash{$keys}} ?
                    The hash keys get sorted by the sort command,
                    Code:
                    (sort keys %hash)
                    but they will be sorted in alphabetical order, rather than the order you want.
                    Without sort, the keys will just be in random order.

                    To get the keys that you want, instead of looping through the keys, you could do

                    Code:
                    # loop through all the different headers
                    for (my $i = 0; $i < @{$hash{$names[0]}; $i++){
                        # loop through the names array using foreach or for
                        # this will go through the hash keys in the order you want
                        foreach my $field(@names){
                             if (exists($hash{$field}{
                                 print $hash{$field[$i]}, "|";
                             }
                        }
                        print "\n";
                    }
                    There's usually more than one way to do things in perl, and I think it might be simpler to do what you are trying to do by reading the file with your headers one line at a time, matching the relevant parts of the header with
                    regular expressions, modifying the parts of the header you want to change, then print the modified line. This would avoid having all the complicated hashes.

                    maria

                    Comment


                    • #11
                      OK - think I cracked it:


                      my (@gi, @number, @gb, @id, @name, @test);
                      my @heads = split('\n', $hits);
                      my $increment = 0;
                      foreach (@heads) {
                      my @split = split('\|');
                      ++$increment;
                      push @gi, $split[0];
                      push @number, $split[1];
                      push @gb, $split[2];
                      push @id, "$split[3]$increment";
                      push @test, $split[3];
                      push @name, $split[4];
                      }

                      for (my $i=0; $i <= $#heads; $i++) {
                      print HIT "$gi[$i]|$number[$i]|$gb[$i]|$id[$i]|$name[$i]\n";
                      }

                      my $pre_split1 = $heads[0];
                      my $pre_split2 = $heads[-1];
                      my $post_split1 = ("$gi[0]|$number[0]|$gb[0]|$test[0]|$name[0]");
                      my $post_split2 = ("$gi[-1]|$number[-1]|$gb[-1]|$test[-1]|$name[-1]");

                      if ($pre_split1 eq $post_split1 && $pre_split2 eq $post_split2) {
                      print "Headers are correctly ordered\n";
                      print "Finished splitting BLAST file\n";
                      }
                      else {
                      print "WARNING: Headers appear to be incorrectly ordered\n";
                      }



                      Any comments?

                      Thanks for everyone's help!

                      Comment


                      • #12
                        I think that your code is correct. At least it looks good upon first glance. However I agree (mostly) with 'mastel'. Since you do not need all of your hits (header lines) at the same time -- that is, you are not doing any inter-hit comparisons -- then it is a better solution to process and print the hits one at a time instead of processing all of them and then printing them one at a time. You can easily run out of computer memory by holding everything in memory.

                        'mastel' suggests a regex method. This is where I disagree with him, at least for this simple task. Regular expressions tend to cause two problems where only one existed before. :-)

                        On the other hand, what better way to learn about arrays-of-arrays or hashes-of-arrays or my favorite hashes-of-hashes as well as regexps than to apply these techniques to a simple problem?

                        BTW, if you don't know, Data:umper is your friend.

                        Comment


                        • #13
                          Originally posted by mastal View Post
                          There's usually more than one way to do things in perl
                          maria
                          This is why I don't usually help people on the internet with Perl. I suggest Stackoverflow for all your file format manipulation needs=)

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X