Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parsing multi fasta sequence file using Perl

    Hello,
    I am very new to Perl scripting and scripting in general. I am trying to extract information from a multi fasta file to an output file. I have constructed a script but it isn't giving me the output that I want.

    For example (this would the current fasta file)
    >gi|546265522| SOX6
    acgtcaatccag
    cgattagcaaat
    gtcctgattttgg

    >gi|678457845| CMYC
    gttaccatgcgatg
    caatttgggacacc

    I want (notice the ">" is removed:
    gi|546265522| SOX6
    Seq length: 36
    gi|678457845| CMYC
    Seq length: 28

    I was able to put the lines of the sequences into one string but calculating the length isn't working. I removed any spaces within the new single string and attempted $length = length($line); but that doesn't work. The current method in my script also gives the same output like:::
    >gi|546265522| SOX6
    121212

    >gi|678457845| CMYC
    1414

    How do I solve this problem??
    Ultimately i want some like this but I am taking it in steps so that I actually understand what I'm doing and why.
    gi|678457845| CMYC (tab) seq length (tab) AT/GC content

    Here is my script

    #!/usr/bin/perl -w

    print "file: \n";
    $in = <STDIN>;
    chomp $in;

    print "output file: \n";
    $out = <STDIN>;
    chop $out;

    unless ( open(IN, $in) ) {
    die ("cant input file $in\n");}

    unless ( open(OUT, ">$out") ) {
    die("cant open output file $out\n");}

    my $line = <IN>;
    print OUT $line;

    while ($line = <IN>)
    {
    chomp $line;
    if ($line=~/^>(.+)/) {
    print OUT "\n",$line,"\n"; }
    else { $line =~ s/^\s*(.*)\s*$/$1/;

    $a=($line=~tr/aA//);
    $c=($line=~tr/cC//);
    $g=($line=~tr/gG//);
    $t=($line=~tr/tT//);
    $n=($line=~tr/nN//);
    $x=($line=~tr/xX//);

    $length = $a + $c + $g + $t + $n + $x;
    print OUT $length; }
    }

    print OUT "\n";

  • #2
    Try using

    use strict;
    use warnings;
    in your perl scripts, its a good practice and eliminates a lot of mistakes in your code.

    One possible issue I can see with this script is that the variables you use for counting need to be declared outside the while loop if you want to use them to count on multi-line fasta files. Also its safer to initialize variables to say 0 when you declare them.

    Also try getting familiar with Bio perl modules, its much easier to parse sequence format files with them.
    Last edited by vivek_; 11-06-2012, 08:03 AM.

    Comment


    • #3
      Originally posted by newbie2this View Post
      Hello,
      I am very new to Perl scripting and scripting in general. I am trying to extract information from a multi fasta file to an output file. I have constructed a script but it isn't giving me the output that I want.

      For example (this would the current fasta file)
      >gi|546265522| SOX6
      acgtcaatccag
      cgattagcaaat
      gtcctgattttgg

      >gi|678457845| CMYC
      gttaccatgcgatg
      caatttgggacacc

      I want (notice the ">" is removed:
      gi|546265522| SOX6
      Seq length: 36
      gi|678457845| CMYC
      Seq length: 28

      I was able to put the lines of the sequences into one string but calculating the length isn't working. I removed any spaces within the new single string and attempted $length = length($line); but that doesn't work. The current method in my script also gives the same output like:::
      >gi|546265522| SOX6
      121212

      >gi|678457845| CMYC
      1414



      How do I solve this problem??
      Ultimately i want some like this but I am taking it in steps so that I actually understand what I'm doing and why.
      gi|678457845| CMYC (tab) seq length (tab) AT/GC content

      Here is my script

      #!/usr/bin/perl -w

      print "file: \n";
      $in = <STDIN>;
      chomp $in;

      print "output file: \n";
      $out = <STDIN>;
      chop $out;

      unless ( open(IN, $in) ) {
      die ("cant input file $in\n");}

      unless ( open(OUT, ">$out") ) {
      die("cant open output file $out\n");}

      my $line = <IN>;
      print OUT $line;

      while ($line = <IN>)
      {
      chomp $line;
      if ($line=~/^>(.+)/) {
      print OUT "\n",$line,"\n"; }
      else { $line =~ s/^\s*(.*)\s*$/$1/;

      $a=($line=~tr/aA//);
      $c=($line=~tr/cC//);
      $g=($line=~tr/gG//);
      $t=($line=~tr/tT//);
      $n=($line=~tr/nN//);
      $x=($line=~tr/xX//);

      $length = $a + $c + $g + $t + $n + $x;
      print OUT $length; }
      }

      print OUT "\n";
      >gi|546265522| SOX6
      121212


      this numbers came from the length of each line but printed in the same line with any space then the number is like "121212" you should read "12-12-12" and if you add in the line marked in red the "-" you obtain this. A way to get the total length of every contig, you have to sum the length of the lines that form each contig and when you find another contig put the length equal to 0 and sum the lines of each contig. If you need more help I modify the code but I think you can modify it.

      Comment


      • #4
        thanks bunches for the help but I still wasn`t able to get my script to run the way i want it to.

        Comment


        • #5
          n2t,

          One thing I almost always do when dealing with FASTA files in perl is to change the input record separator from the default newline to the more useful (for FASTA records anyway) ">". This allows me to deal with the full FASTA record, definition line and sequence as a single initial chunk of data instead of line by line. Additionally perl has a "length" function; you don't have to count and add characters.

          Here is a heavily commented bit of perl which will accomplish what you asked.

          Code:
          #!/usr/bin/perl
          
          use strict;
          use warnings;
          
          my $inFile = shift; 
          open (IN, "$inFile");
          
          # By default Perl pulls in chunks of text up to a newline (\n) character; newline is
          # the default Input Record Separator. You can change the Input Record Separator by
          # using the special variable "$/". When dealing with FASTA files I normally change the
          # Input Record Separator to ">" which allows your script to take in a full, multiline
          # FASTA record at once.
          
          $/ = ">";
          
          # At each input your script will now read text up to and including the first ">" it encounters.
          # This means you have to deal with the first ">" at the begining of the file as a special case.
          
          my $junk = <IN>; # Discard the ">" at the begining of the file
          
          # Now read through your input file one sequence record at a time. Each input record will be a
          # multiline FASTA entry.
          
          while ( my $record = <IN> ) {
          	chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record
          	
          # 	Now split up your record into its definition line and sequence lines using split at each newline.
          # 	The definition will be stored in a scalar variable and each sequence line as an
          # 	element of an array.
          	
          	my ($defLine, @seqLines) = split /\n/, $record;
          	
          #	Join the individual sequence lines into one single sequence and store in a scalar variable.
          	
          	my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
          	
          	print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.
          	
          	print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline
          
          }

          Comment


          • #6
            You'll do yourself a huge favor if you resist reinventing the wheel (not that I always follow this advice!).

            In particular, take a look at Bio::SeqIO -- this will let you read & write many sequence formats, including FASTA. Your particular program is roughly (taking the input file from the command line and outputting to STDOUT)

            Code:
            #!/usr/bin/perl
            use strict;
            use Bio::SeqIO;
            my $reader=new Bio::SeqIO(-format=>'fasta',-file=>shift);
            while (my $seqRec=$reader->next_seq)
            {
                print join("\t",$seqRec->id,$seqRec->length),"\n";
            }
            Of course, what you were attempting will get you up to speed on standard Perl, but you'll go far learning the higher-level idioms that BioPerl and other libraries give you, and still have plenty to do with the basics.

            Comment


            • #7
              Originally posted by kmcarr View Post
              n2t,

              One thing I almost always do when dealing with FASTA files in perl is to change the input record separator from the default newline to the more useful (for FASTA records anyway) ">". This allows me to deal with the full FASTA record, definition line and sequence as a single initial chunk of data instead of line by line. Additionally perl has a "length" function; you don't have to count and add characters.

              Here is a heavily commented bit of perl which will accomplish what you asked.

              Code:
              #!/usr/bin/perl
              
              use strict;
              use warnings;
              
              my $inFile = shift; 
              open (IN, "$inFile");
              
              # By default Perl pulls in chunks of text up to a newline (\n) character; newline is
              # the default Input Record Separator. You can change the Input Record Separator by
              # using the special variable "$/". When dealing with FASTA files I normally change the
              # Input Record Separator to ">" which allows your script to take in a full, multiline
              # FASTA record at once.
              
              $/ = ">";
              
              # At each input your script will now read text up to and including the first ">" it encounters.
              # This means you have to deal with the first ">" at the begining of the file as a special case.
              
              my $junk = <IN>; # Discard the ">" at the begining of the file
              
              # Now read through your input file one sequence record at a time. Each input record will be a
              # multiline FASTA entry.
              
              while ( my $record = <IN> ) {
              	chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record
              	
              # 	Now split up your record into its definition line and sequence lines using split at each newline.
              # 	The definition will be stored in a scalar variable and each sequence line as an
              # 	element of an array.
              	
              	my ($defLine, @seqLines) = split /\n/, $record;
              	
              #	Join the individual sequence lines into one single sequence and store in a scalar variable.
              	
              	my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
              	
              	print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.
              	
              	print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline
              
              }
              with some minor changea and addition to suit my objective, this worked like a charm. Thanks for all the help.

              Comment


              • #8
                Thanks,
                that script was very useful and simplified my attempt a lot.
                From a perl script of view, a lot of things can be simplified.
                I was only interested in the e.g. length (for statistics):

                while(<>){
                chomp;
                @split = split /\n/;
                push(@size_list, length $split[1]);

                }

                shift @size_list;

                Still, I guess lot of ways to improve and do it another way

                Comment


                • #9
                  Originally posted by ebioman View Post
                  Thanks,
                  that script was very useful and simplified my attempt a lot.
                  From a perl script of view, a lot of things can be simplified.
                  I was only interested in the e.g. length (for statistics):

                  while(<>){
                  chomp;
                  @split = split /\n/;
                  push(@size_list, length $split[1]);

                  }

                  shift @size_list;

                  Still, I guess lot of ways to improve and do it another way
                  You really should follow the best practices and use modern pragmas. That means putting
                  Code:
                  use strict; 
                  use warnings;
                  and the top of your script and use lexical variables to limit their scope and avoid collisions. Writing in a very minimal style might mean a little less typing, but it will also mean that your script will fail silently, which won't save you any time in the long run.

                  Comment


                  • #10
                    My fault, I actually only copied a snippet, since my program is way larger.
                    But you are right:

                    Code:
                    #!/usr/bin/perl
                    use warnings;
                    use diagnostics;
                    use strict;
                    
                    $/ = ">";
                    my (@size_list,@split);
                    
                    while(<>){
                         chomp;
                         @split = split /\n/;
                         push(@size_list, length $split[1]);
                       
                    }
                    
                    shift @size_list;
                    Thx

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X