Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stupid perl scripts for converting colour-space <-> base-space

    I whipped up a couple of 'stupid' conversion scripts for base/colour space conversion that should work with a few common sequence formats that get thrown their way, as long as the sequences are on single lines.

    To base space:
    Code:
    # cs2base.pl -- stupidly convert from colour-space to base-space
    
    # Author: David Eccles (gringer) 2011 <[email protected]>
    
    use warnings;
    use strict;
    
    while(<>){
        if(!(/^[ACGT0123NX\.]+$/)){
            print;
            next;
        }
        while(/[ACGT][0123]/){
            s/A0/AA/;s/C0/CC/;s/G0/GG/;s/T0/TT/;
            s/A1/AC/;s/C1/CA/;s/G1/GT/;s/T1/TG/;
            s/A2/AG/;s/C2/CT/;s/G2/GA/;s/T2/TC/;
            s/A3/AT/;s/C3/CG/;s/G3/GC/;s/T3/TA/;
        }
        print;
    }
    To colour-space:
    Code:
    # base2cs.pl -- stupidly convert from base-space to colour-space
    
    # Author: David Eccles (gringer) 2011 <[email protected]>
    
    use warnings;
    use strict;
    
    while(<>){
        if(!(/^[ACGT0123NX\.]+$/)){
            print;
            next;
        }
        while(/[ACGT0123][ACGT]/){
            s/AA([^ACGT])/A0$1/;
            s/CC([^ACGT])/C0$1/;
            s/GG([^ACGT])/G0$1/;
            s/TT([^ACGT])/T0$1/;
            ;s/AC([^ACGT])/A1$1/;s/AG([^ACGT])/A2$1/;s/AT([^ACGT])/A3$1/;
            s/CA([^ACGT])/C1$1/;;s/CG([^ACGT])/C3$1/;s/CT([^ACGT])/C2$1/;
            s/GA([^ACGT])/G2$1/;s/GC([^ACGT])/G3$1/;;s/GT([^ACGT])/G1$1/;
            s/TA([^ACGT])/T3$1/;s/TC([^ACGT])/T2$1/;s/TG([^ACGT])/T1$1/;;
        }
        print;
    }
    So, I said these should work with most things that get thrown at them:

    Code:
    $ echo "AGCGAGCTCAGCATCAGGCATCGACTAGCATCAACACTAC" | ~/scripts/base2cs.pl
    A233223221231321203132321232313210111231
    Code:
    $ head * | ~/scripts/base2cs.pl
    ==> 454_test.fasta <==
    >F6AJIXP02GO67R length=38 xy=2630_1077 region=2 run=R_2009_11_25_10_02_47_
    T1010213211122112101320323211013121220
    >F6AJIXP02GT62X length=48 xy=2687_0711 region=2 run=R_2009_11_25_10_02_47_
    T30131301202131101133221101010100332322301212NG0
    >F6AJIXP02FVOVL length=122 xy=2294_0543 region=2 run=R_2009_11_25_10_02_47_
    T30220331212110313033112201023322020111132131122321201311301
    T0101220220223232211130211010011010NC2322NA33NANA03301213332
    T0
    >F6AJIXP02JSYRL length=43 xy=3903_1391 region=2 run=R_2009_11_25_10_02_47_
    T322330321330113123020110202310220011220032
    
    ==> illumina_test.fasta <==
    >GAPC01_0005:7:1:1120:16293#0/1
    C03011NA133100113311333300NNNNNNNNNN
    >GAPC01_0005:7:1:1120:10747#0/1
    A02130003032213002030012021223310120
    >GAPC01_0005:7:1:1120:20021#0/1
    C00103010200003011103112111131201121
    >GAPC01_0005:7:1:1120:11773#0/1
    T33030102300300220300022022312021232
    
    
    ==> phix_genome.fasta <==
    >gi|9626372|ref|NC_001422.1| Enterobacteria phage phiX174, complete genome
    G221000332332020131213312202103011120023023330022123122123200003033220
    G233002312020303123121320110031320303003232021102121321033020003122200
    A302321023320201333123223222023222031200133210200233013210123013230221
    T210000121213331010231220222021103203033313201031131023121020212101003
    G233312212111300011021310132222302222011012113000300022233110230312332
    T122120323132110210101123033201302200321312212102103121201103203131310
    T201221033200103022233030232213021203202213031000102300301032022312300
    C323000221213221301100210010230132312121033222231132231233213310122032
    T133100331013133210212001110023310022332002021322021101221003301321303
    
    ==> test.fastq <==
    @H134:1:1201:1131:1970#0/1
    C21220312301200NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +
    babeeeeegggggghBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @H134:1:1201:1155:1987#0/1
    C10132022222200322132222211013023110220311101320312
    +
    b_beeeeegggggiiiighiiiiiiihihiiiihhfhghiiiiieegdghd
    @H134:1:1201:1086:1989#0/1
    C03203203022110330222NG003330333220331201211330NA01
    The output is a bit quirky, but might be sufficient as a substitute for doing things 'by hand'.

    Note that these are slow due to heavy use of regular expression substitution, and a loop over the entire line for each substitution (in the worst case). The 'proper' way would be to iterate over the string from each line, and call a function that depends on the previous character and the current character.

  • #2
    This definitely could be optimized. You don't need all of those regexes, since you can use the XOR operation to convert two colors to a base (see below). I hope this helps.

    Code:
    #!/bin/perl
    
    use strict;
    use warnings;
    
    while(defined(my $seq = <STDIN>)) {
        chomp($seq);
        $seq =~ tr/ACGTacgt/01230123/;
        for(my $i=1;$i<length($seq);$i++) {
            substr($seq, $i, 1) = int(substr($seq, $i-1, 1)) ^ int(substr($seq, $i, 1));     
        }
        $seq = substr($seq, 1);
        $seq =~ tr/01234/ACGTN/;
        print "$seq\n";
    }
    Last edited by nilshomer; 07-13-2011, 04:46 AM.

    Comment


    • #3
      Sure, it can be optimised quite easily. I was going for the literal rules translation, rather than writing it using better code. But be careful with how you do that... my script ignores things that don't look like sequence:
      Code:
      $ ~/scripts/base2cs.pl 454_test.fasta | perl ./nilshomer.pl | diff - 454_test.fasta 
      Argument ">" isn't numeric in int at ./nilshomer.pl line 10, <STDIN> line 1.
      Argument "F" isn't numeric in int at ./nilshomer.pl line 10, <STDIN> line 1.
      Argument "J" isn't numeric in int at ./nilshomer.pl line 10, <STDIN> line 1.
      ... [many similar lines skipped]
      substr outside of string at ./nilshomer.pl line 12, <STDIN> line 11.
      Use of uninitialized value in transliteration (tr///) at ./nilshomer.pl line 13, <STDIN> line 11.
      Use of uninitialized value $seq in concatenation (.) or string at ./nilshomer.pl line 14, <STDIN> line 11.
      1,10c1,10
      < A6666666N66A77777756665CGGGGGA6555NNTNNNN66666NNNNNNNN666CNN5NN6TTGGGGAANTT
      < GGTTCATCACAGACAGTTGCTTAGCTGTTGCAGTCTT
      < A6666666N65TCCCCCCTAAANCTTTTTC7CNTTTN5NNNN66666NNNNNNNN666CNN5NN6TTGGGGAANTT
      < AACGTAACTTCATGTTGTATCTGTTGGTTGGGCGATCTAACTGAAGG
      < A6666666NNNNNNNNNN6555N6NNNNN6NCG666T7NNNN66666NNNNNNNN666CNN5NN6TTGGGGAANTT
      < AAGAATACTGACAATGCCGCACTCCAAGCGAGGAACACATCATGTCTAGTCCATGTAAC
      < TGGTCTTCTTCTAGCTCACATTCACCAAACAACCCAGCTCCCGCCCCCCGCCAGTATAG
      < T
      < A6666666NNNNNNNNNN6555CGGGGGC88CAACGCACCCCTTTTTCCCCCCCCTTTCCCACCT66777755C66
      < AGATAATCATAACATGATTCCACCTTCGTTCTTTGTCTTTAG
      ---
      > >F6AJIXP02GO67R length=38 xy=2630_1077 region=2 run=R_2009_11_25_10_02_47_
      > TGGTTCATCACAGACAGTTGCTTAGCTGTTGCAGTCTT
      > >F6AJIXP02GT62X length=48 xy=2687_0711 region=2 run=R_2009_11_25_10_02_47_
      > TAACGTAACTTCATGTTGTATCTGTTGGTTGGGCGATCTAACTGANGG
      > >F6AJIXP02FVOVL length=122 xy=2294_0543 region=2 run=R_2009_11_25_10_02_47_
      > TAAGAATACTGACAATGCCGCACTCCAAGCGAGGAACACATCATGTCTAGTCCATGTAAC
      > TTGGTCTTCTTCTAGCTCACATTCACCAAACAACCNCTAGANATANANAATAACTGCGCT
      > TT
      > >F6AJIXP02JSYRL length=43 xy=3903_1391 region=2 run=R_2009_11_25_10_02_47_
      > TAGATAATCATAACATGATTCCACCTTCGTTCTTTGTCTTTAG
      So they're quite different. Here's a slight modification of your script:

      Code:
      #!/usr/bin/perl
      
      # for converting colour-space to base-space
      
      use strict;
      use warnings;
      
      while(<>) {
          if(!(/^[ACGT0123NX\.]+$/)){
              print;
              next;
          }
          my $seq = $_;
          $seq =~ tr/ACGTacgt/01230123/;
          for(my $i=1;$i<length($seq);$i++) {
              my $pair = substr($seq, $i-1, 2);
              if($pair =~ /[0123][0123]/){
                substr($seq, $i, 1) = 
                    int(substr($seq, $i-1, 1)) ^ int(substr($seq, $i, 1));
              }
          }
          $seq =~ tr/0123/ACGT/;
          print $seq;
      }
      This seems to be able to reverse my basetocs.pl script successfully:
      Code:
      $ ~/scripts/base2cs.pl 454_test.fasta | ./nilshomer_modified.pl | diff - 454_test.fasta
      [no output -- i.e. inputs are the same]

      Comment


      • #4
        Ooops! I wrote my response post (below the '-----------') without considering the first line in your post:

        ... as long as the sequences are on single lines.
        I apologize for missing that statement. Although I do wonder why you wrote programs that do not work on multi-line input and then (and this is what probably tripped me up) tested them on multi-line sequences -- and thus generating non-valid output sequences that can not be read by any other colorspace-aware program.

        Just for reference I am keeping my original post below.

        ------------------


        Aside from, as Nils said, potential inefficiency, the code simply does not generate legal colorspace from basespace nor vice-versa. Both base space and colorspace are not per-line outputs. Instead they convert the entire sequence. In other words, using the canonical program 'encodeFasta' from ABI/Lifetech' initial 'corona' package then starting with the following basespace sequence:

        Code:
        >my sequence
        AAACGTGC
        CACCT
        One should get the colorspace sequence

        Code:
        >my sequence
        A001311301102
        And conversely, a color-space sequence that looks like:

        Code:
        >my cs seq
        A000123321
        000111222
        Looks like:
        Code:
        >my cs seq
        AAACTATCAAAACACTCT
        Perhaps obviously the number of characters on a line is not significant. But do treat the entire sequence as one.

        Also your code does not handle 'N's nor the corresponding 4's. Nor a period in either colorspace or basespace (at least in any sensible manner). I have also seen 5 and 6s in colorspace. Granted once a sequence goes into Ns or 4,5,6s then it becomes so much garbage but one should at least take care of such occurrences .
        Last edited by westerman; 07-13-2011, 10:09 AM. Reason: Missed first line in original post.

        Comment


        • #5
          Aside from, as Nils said, potential inefficiency, the code simply does not generate legal colorspace from basespace nor vice-versa. Both base space and colorspace are not per-line outputs. Instead they convert the entire sequence.
          This is the "stupid" bit of my code. It should work with fastq files (which are strictly per-line files), as long as the quality scores do not match ^[ACGT0-6NX\.]+$. Given the inaccuracy of colour-space to base-space conversions for long sequences, I can't think of why you'd want to translate more than one line into colour-space.

          However, just because I had a bit of time on my hands, I've done an update that is a bit more efficient (it now takes about ten times as long as gzip), and should be able to manage multi-line files.

          First for colour-space to base-space:
          Code:
          #!/usr/bin/perl
          # cs2base.pl -- stupidly convert from colour-space to base-space
          
          # Author: David Eccles (gringer) 2011 <[email protected]>
          
          use warnings;
          use strict;
          
          my $lastBase = "";
          
          while(<>){
              if(!(/^[ACGT0123456NX\.]+$/)){
                  print;
                  $lastBase = "";
                  next;
              }
              my $sequence = $lastBase.$_;
              my @chars = split(//, $sequence);
              for(my $i = 1; $i < scalar(@chars); $i++){
                  if(($chars[$i] eq ".") ||
                     ($chars[$i] eq "4") || 
                     ($chars[$i] eq "5") || 
                     ($chars[$i] eq "6")){
                      $chars[$i] = "N";
                  } elsif($chars[$i] eq "0"){
                      $chars[$i] = $chars[$i-1];
                  } elsif((($chars[$i-1] eq "C") && ($chars[$i] eq "1")) ||
                          (($chars[$i-1] eq "G") && ($chars[$i] eq "2")) ||
                          (($chars[$i-1] eq "T") && ($chars[$i] eq "3"))){
                      $chars[$i] = "A";
                  } elsif((($chars[$i-1] eq "A") && ($chars[$i] eq "1")) ||
                          (($chars[$i-1] eq "T") && ($chars[$i] eq "2")) ||
                          (($chars[$i-1] eq "G") && ($chars[$i] eq "3"))){
                      $chars[$i] = "C";
                  } elsif((($chars[$i-1] eq "T") && ($chars[$i] eq "1")) ||
                          (($chars[$i-1] eq "A") && ($chars[$i] eq "2")) ||
                          (($chars[$i-1] eq "C") && ($chars[$i] eq "3"))){
                      $chars[$i] = "G";
                  } elsif((($chars[$i-1] eq "G") && ($chars[$i] eq "1")) ||
                          (($chars[$i-1] eq "C") && ($chars[$i] eq "2")) ||
                          (($chars[$i-1] eq "A") && ($chars[$i] eq "3"))){
                      $chars[$i] = "T";
                  }
              }
              if($lastBase){
                  $chars[0] = "";
              }
              $lastBase = $chars[$#chars-1];
              print(join("",@chars));
          }
          And then base-space to colour-space:
          Code:
          #!/usr/bin/perl
          # base2cs.pl -- stupidly convert from base-space to colour-space
          
          # Author: David Eccles (gringer) 2011 <[email protected]>
          
          use warnings;
          use strict;
          
          my $sequenceContinue = 0; # false
          my $lastBase = "";
          
          while(<>){
              if(!(/^[ACGT0123456NX\.]+$/)){
                  print;
                  $sequenceContinue = 0; # false
                  $lastBase = "";
                  next;
              }
              my $sequence = $lastBase.$_;
              my @chars = split(//, $sequence);
              $lastBase = $chars[$#chars-1];
              for(my $i = $#chars; $i > 0; $i--){
                  if($chars[$i] eq $chars[$i-1]){
                      if(($chars[$i] eq ".") ||
                         ($chars[$i] eq "X") || 
                         ($chars[$i] eq "N")){
                          $chars[$i] = ".";
                      } else {
                          $chars[$i] = "0";
                      }
                  } elsif((($chars[$i-1] eq "A") && ($chars[$i] eq "C")) ||
                          (($chars[$i-1] eq "C") && ($chars[$i] eq "A")) ||
                          (($chars[$i-1] eq "G") && ($chars[$i] eq "T")) ||
                          (($chars[$i-1] eq "T") && ($chars[$i] eq "G"))){
                      $chars[$i] = "1";
                  } elsif((($chars[$i-1] eq "A") && ($chars[$i] eq "G")) ||
                          (($chars[$i-1] eq "C") && ($chars[$i] eq "T")) ||
                          (($chars[$i-1] eq "G") && ($chars[$i] eq "A")) ||
                          (($chars[$i-1] eq "T") && ($chars[$i] eq "C"))){
                      $chars[$i] = "2";
                  } elsif((($chars[$i-1] eq "A") && ($chars[$i] eq "T")) ||
                          (($chars[$i-1] eq "C") && ($chars[$i] eq "G")) ||
                          (($chars[$i-1] eq "G") && ($chars[$i] eq "C")) ||
                          (($chars[$i-1] eq "T") && ($chars[$i] eq "A"))){
                      $chars[$i] = "3";
                  } elsif(($chars[$i] eq "N") || 
                          ($chars[$i] eq "X") || 
                          ($chars[$i] eq ".")){
                      $chars[$i] = ".";
                  }
              }
              if($sequenceContinue){
                  $chars[0] = "";
              }
              $sequenceContinue = 1; # true
              print(join("",@chars));
          }
          It produces slightly odd output around the mis-reads, but this is to make it reversible:

          Code:
          $ ~/scripts/base2cs.pl 454_test.fasta
          >F6AJIXP02GO67R length=38 xy=2630_1077 region=2 run=R_2009_11_25_10_02_47_
          T1010213211122112101320323211013121220
          >F6AJIXP02GT62X length=48 xy=2687_0711 region=2 run=R_2009_11_25_10_02_47_
          T30131301202131101133221101010100332322301212.G0
          >F6AJIXP02FVOVL length=122 xy=2294_0543 region=2 run=R_2009_11_25_10_02_47_
          T30220331212110313033112201023322020111132131122321201311301
          20101220220223232211130211010011010.C2322.A33.A.A03301213332
          00
          >F6AJIXP02JSYRL length=43 xy=3903_1391 region=2 run=R_2009_11_25_10_02_47_
          T322330321330113123020110202310220011220032
          $ ~/scripts/base2cs.pl 454_test.fasta | ~/scripts/cs2base.pl | diff - 454_test.fasta 
          [no output, i.e. no differences]
          If I had used the proper 4/5/6 notation, then everything after a misread would be lost.
          Last edited by gringer; 07-19-2011, 02:47 AM. Reason: useless extra code

          Comment


          • #6
            Given the inaccuracy of colour-space to base-space conversions for long sequences, I can't think of why you'd want to translate more than one line into colour-space.
            To be more precise about this, there is no inaccuracy of converting CS to BS. The algorithm is exact. The problem occurs when converting inaccurate *data* from CS to BS. Garbage-in. Garbage out. The problem will most often occur when you try to convert a single read from CS to BS. Converting a correctly mapped and high quality consensus from CS to BS will not cause any inaccuracy.

            The best practice when working with CS reads is to do as much work (mapping, snp calling, etc.) within CS and only at the very end convert from CS to BS. Anything else will eliminate the power of CS (e.g., the ability to distinguish between a machine error and a true variant) and risk inaccurate bases.

            ........

            If I had used the proper 4/5/6 notation, then everything after a misread would be lost.
            Yeah, well it works for your two programs for inter-conversion. But it won't work for the rest of us. Let us know if and when you generate and read true color-space files. In the meantime you might as well call your programs 'conversion between base-space and gringer-space' because that is what you essentially are doing.

            Comment


            • #7
              Let us know if and when you generate and read true color-space files.
              Here you go. I've updated the scripts to work with strict colour-space. First colour-space to base-space:
              Code:
              #!/usr/bin/perl
              # cs2base.pl -- convert from colour-space to base-space
              
              # Author: David Eccles (gringer) 2011 <[email protected]>
              
              use warnings;
              use strict;
              
              my $lastBase = "";
              
              while(<>){
                  if(!(/^[ACGT0123456NX\.]+$/)){
                      print;
                      $lastBase = "";
                      next;
                  }
                  my $sequence = $lastBase.$_;
                  my @chars = split(//, $sequence);
                  for(my $i = 1; $i < scalar(@chars); $i++){
                      if(($chars[$i] eq ".") ||
                         ($chars[$i] eq "4") || 
                         ($chars[$i] eq "5") || 
                         ($chars[$i] eq "6")){
                          $chars[$i] = "N";
                      } elsif($chars[$i] eq "0"){
                          $chars[$i] = $chars[$i-1];
                      } elsif((($chars[$i-1] eq "C") && ($chars[$i] eq "1")) ||
                              (($chars[$i-1] eq "G") && ($chars[$i] eq "2")) ||
                              (($chars[$i-1] eq "T") && ($chars[$i] eq "3"))){
                          $chars[$i] = "A";
                      } elsif((($chars[$i-1] eq "A") && ($chars[$i] eq "1")) ||
                              (($chars[$i-1] eq "T") && ($chars[$i] eq "2")) ||
                              (($chars[$i-1] eq "G") && ($chars[$i] eq "3"))){
                          $chars[$i] = "C";
                      } elsif((($chars[$i-1] eq "T") && ($chars[$i] eq "1")) ||
                              (($chars[$i-1] eq "A") && ($chars[$i] eq "2")) ||
                              (($chars[$i-1] eq "C") && ($chars[$i] eq "3"))){
                          $chars[$i] = "G";
                      } elsif((($chars[$i-1] eq "G") && ($chars[$i] eq "1")) ||
                              (($chars[$i-1] eq "C") && ($chars[$i] eq "2")) ||
                              (($chars[$i-1] eq "A") && ($chars[$i] eq "3"))){
                          $chars[$i] = "T";
                      } elsif(($chars[$i-1] eq "N") && (($chars[$i] eq "0")||
                                                        ($chars[$i] eq "1")||
                                                        ($chars[$i] eq "2")||
                                                        ($chars[$i] eq "3"))){
                          $chars[$i] = "N";
                      }
                  }
                  if($lastBase){
                      $chars[0] = "";
                  }
                  $lastBase = $chars[$#chars-1];
                  print(join("",@chars));
              }
              And then base-space to colour-space:
              Code:
              #!/usr/bin/perl
              # base2cs.pl -- convert from base-space to colour-space
              
              # Author: David Eccles (gringer) 2011 <[email protected]>
              
              use warnings;
              use strict;
              
              my $sequenceContinue = 0; # false
              my $lastBase = "";
              
              my $strictColourSpace = 0; # false
              
              if(@ARGV && ($ARGV[0] eq "-strict")){
                  shift(@ARGV);
                  $strictColourSpace = 1;
              }
              
              while(<>){
                  if(!(/^[ACGT0123456NX\.]+$/)){
                      print;
                      $sequenceContinue = 0; # false
                      $lastBase = "";
                      next;
                  }
                  my $sequence = $lastBase.$_;
                  my @chars = split(//, $sequence);
                  $lastBase = $chars[$#chars-1];
                  for(my $i = $#chars; $i > 0; $i--){
                      if($chars[$i] eq $chars[$i-1]){
                          if(($chars[$i] eq ".") ||
                             ($chars[$i] eq "X") || 
                             ($chars[$i] eq "N")){
                              $chars[$i] = ".";
                          } else {
                              $chars[$i] = "0";
                          }
                      } elsif((($chars[$i-1] eq "A") && ($chars[$i] eq "C")) ||
                              (($chars[$i-1] eq "C") && ($chars[$i] eq "A")) ||
                              (($chars[$i-1] eq "G") && ($chars[$i] eq "T")) ||
                              (($chars[$i-1] eq "T") && ($chars[$i] eq "G"))){
                          $chars[$i] = "1";
                      } elsif((($chars[$i-1] eq "A") && ($chars[$i] eq "G")) ||
                              (($chars[$i-1] eq "C") && ($chars[$i] eq "T")) ||
                              (($chars[$i-1] eq "G") && ($chars[$i] eq "A")) ||
                              (($chars[$i-1] eq "T") && ($chars[$i] eq "C"))){
                          $chars[$i] = "2";
                      } elsif((($chars[$i-1] eq "A") && ($chars[$i] eq "T")) ||
                              (($chars[$i-1] eq "C") && ($chars[$i] eq "G")) ||
                              (($chars[$i-1] eq "G") && ($chars[$i] eq "C")) ||
                              (($chars[$i-1] eq "T") && ($chars[$i] eq "A"))){
                          $chars[$i] = "3";
                      } elsif(($chars[$i] eq "N") || 
                              ($chars[$i] eq "X") || 
                              ($chars[$i] eq ".")){
                          # from http://seqanswers.com/forums/showpost.php?p=8619&postcount=18
                          if(($chars[$i-1] eq "A") ||
                             ($chars[$i-1] eq "C") ||
                             ($chars[$i-1] eq "G") ||
                             ($chars[$i-1] eq "T")){
                              $chars[$i] = "4";
                          } elsif(($chars[$i-1] eq "N") ||
                                  ($chars[$i-1] eq "X") ||
                                  ($chars[$i-1] eq ".")){
                              $chars[$i] = "6";
                          }
                      } elsif($strictColourSpace && (($chars[$i-1] eq "N") || 
                                                     ($chars[$i-1] eq "X") || 
                                                     ($chars[$i-1] eq "."))){
                          if(($chars[$i] eq "A") ||
                             ($chars[$i] eq "C") ||
                             ($chars[$i] eq "G") ||
                             ($chars[$i] eq "T")){
                              $chars[$i] = "5";
                          }
                      }
                  }
                  if($sequenceContinue){
                      $chars[0] = "";
                  }
                  $sequenceContinue = 1; # true
                  print(join("",@chars));
              }
              You need to use the command line option '-strict' in order to output a strict colour-space sequence. Otherwise, when a base is known following a misread, that base is output in the sequence:
              Code:
              $ ~/scripts/base2cs.pl 454_test.fasta
              >F6AJIXP02GO67R length=38 xy=2630_1077 region=2 run=R_2009_11_25_10_02_47_
              T1010213211122112101320323211013121220
              >F6AJIXP02GT62X length=48 xy=2687_0711 region=2 run=R_2009_11_25_10_02_47_
              T301313012021311011332211010101003323223012124G0
              >F6AJIXP02FVOVL length=122 xy=2294_0543 region=2 run=R_2009_11_25_10_02_47_
              T30220331212110313033112201023322020111132131122321201311301
              201012202202232322111302110100110104C23224A334A4A03301213332
              00
              >F6AJIXP02JSYRL length=43 xy=3903_1391 region=2 run=R_2009_11_25_10_02_47_
              T322330321330113123020110202310220011220032
              $ ~/scripts/base2cs.pl 454_test.fasta | ~/scripts/cs2base.pl | diff - 454_test.fasta 
              $ ~/scripts/base2cs.pl -strict 454_test.fasta | ~/scripts/cs2base.pl | diff - 454_test.fasta 
              4c4
              < TAACGTAACTTCATGTTGTATCTGTTGGTTGGGCGATCTAACTGANNN
              ---
              > TAACGTAACTTCATGTTGTATCTGTTGGTTGGGCGATCTAACTGANGG
              7,8c7,8
              < TTGGTCTTCTTCTAGCTCACATTCACCAAACAACCNNNNNNNNNNNNNNNNNNNNNNNNN
              < NN
              ---
              > TTGGTCTTCTTCTAGCTCACATTCACCAAACAACCNCTAGANATANANAATAACTGCGCT
              > TT

              Comment


              • #8
                Looks good. I compared your output to both ABI's corona-based encodeFasta.py script and my own perl-based scripts. There are a couple of differences in the encoding of Ns from base-space to color-space (e.g., use of a '4' instead of a '5') but this does not make a big difference -- an unknown is an unknown no matter how it is represented.

                I know that your post is entitled 'stupid perl' scripts so I won't comment on the code much except to note that on a large multi-consensus file your script took 80-90 seconds to encode and decode while ABI's python script to 80 seconds to go from BS to CS and 25 seconds to do the opposite and my own Perl scripts were bi-directional at 45 seconds each. So there is room for speed improvements in your script. However I doubt if such improvements are worth the effort except as a way to learn other Perl techniques.

                There is one code comment I do want to make and that is the use of your command line parameter. While there is no "set in stone" standards, in the unix world a single dash indicates one or more single letter options while a double-dash indicates a word or partial word option. In other words, from a unix background, I would expect that your command line option would either be:

                -s

                or

                --strict

                but not what you are looking for which is

                -strict

                BTW, the double-dash could also be found (if unique as a parameter) as:

                --st
                --str
                --stri
                --stric

                I suggest the use of the Getopt::Long module to do your command line parsing.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                68 views
                0 likes
                Last Post seqadmin  
                Working...
                X