SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract sequence from multi fasta file with PERL andreitudor Bioinformatics 27 07-07-2019 08:45 AM
Does Picard tool support multi-fasta file? ikarus97 Bioinformatics 2 05-13-2013 05:59 PM
perl : Remove redundant feature in fasta file StephaniePi83 Bioinformatics 9 12-15-2012 07:01 PM
Inquiry of Parsing tblastn result to get nucleotide sequence from the db file sunfuhui Bioinformatics 2 09-12-2012 06:39 AM
Perl: get specific base from FASTA file. njh_TO Bioinformatics 6 02-02-2012 06:34 AM

Reply
 
Thread Tools
Old 11-06-2012, 07:26 AM   #1
newbie2this
Junior Member
 
Location: VA

Join Date: Nov 2012
Posts: 4
Default Parsing multi fasta sequence file using Perl

Hello,
I am very new to Perl scripting and scripting in general. I am trying to extract information from a multi fasta file to an output file. I have constructed a script but it isn't giving me the output that I want.

For example (this would the current fasta file)
>gi|546265522| SOX6
acgtcaatccag
cgattagcaaat
gtcctgattttgg

>gi|678457845| CMYC
gttaccatgcgatg
caatttgggacacc

I want (notice the ">" is removed:
gi|546265522| SOX6
Seq length: 36
gi|678457845| CMYC
Seq length: 28

I was able to put the lines of the sequences into one string but calculating the length isn't working. I removed any spaces within the new single string and attempted $length = length($line); but that doesn't work. The current method in my script also gives the same output like:::
>gi|546265522| SOX6
121212

>gi|678457845| CMYC
1414

How do I solve this problem??
Ultimately i want some like this but I am taking it in steps so that I actually understand what I'm doing and why.
gi|678457845| CMYC (tab) seq length (tab) AT/GC content

Here is my script

#!/usr/bin/perl -w

print "file: \n";
$in = <STDIN>;
chomp $in;

print "output file: \n";
$out = <STDIN>;
chop $out;

unless ( open(IN, $in) ) {
die ("cant input file $in\n");}

unless ( open(OUT, ">$out") ) {
die("cant open output file $out\n");}

my $line = <IN>;
print OUT $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~/^>(.+)/) {
print OUT "\n",$line,"\n"; }
else { $line =~ s/^\s*(.*)\s*$/$1/;

$a=($line=~tr/aA//);
$c=($line=~tr/cC//);
$g=($line=~tr/gG//);
$t=($line=~tr/tT//);
$n=($line=~tr/nN//);
$x=($line=~tr/xX//);

$length = $a + $c + $g + $t + $n + $x;
print OUT $length; }
}

print OUT "\n";
newbie2this is offline   Reply With Quote
Old 11-06-2012, 07:52 AM   #2
vivek_
PhD Student
 
Location: Denmark

Join Date: Jul 2012
Posts: 164
Default

Try using

Quote:
use strict;
use warnings;
in your perl scripts, its a good practice and eliminates a lot of mistakes in your code.

One possible issue I can see with this script is that the variables you use for counting need to be declared outside the while loop if you want to use them to count on multi-line fasta files. Also its safer to initialize variables to say 0 when you declare them.

Also try getting familiar with Bio perl modules, its much easier to parse sequence format files with them.

Last edited by vivek_; 11-06-2012 at 08:03 AM.
vivek_ is offline   Reply With Quote
Old 11-06-2012, 08:58 AM   #3
jesus garcia
Junior Member
 
Location: madrid

Join Date: Nov 2011
Posts: 7
Default

Quote:
Originally Posted by newbie2this View Post
Hello,
I am very new to Perl scripting and scripting in general. I am trying to extract information from a multi fasta file to an output file. I have constructed a script but it isn't giving me the output that I want.

For example (this would the current fasta file)
>gi|546265522| SOX6
acgtcaatccag
cgattagcaaat
gtcctgattttgg

>gi|678457845| CMYC
gttaccatgcgatg
caatttgggacacc

I want (notice the ">" is removed:
gi|546265522| SOX6
Seq length: 36
gi|678457845| CMYC
Seq length: 28

I was able to put the lines of the sequences into one string but calculating the length isn't working. I removed any spaces within the new single string and attempted $length = length($line); but that doesn't work. The current method in my script also gives the same output like:::
>gi|546265522| SOX6
121212

>gi|678457845| CMYC
1414



How do I solve this problem??
Ultimately i want some like this but I am taking it in steps so that I actually understand what I'm doing and why.
gi|678457845| CMYC (tab) seq length (tab) AT/GC content

Here is my script

#!/usr/bin/perl -w

print "file: \n";
$in = <STDIN>;
chomp $in;

print "output file: \n";
$out = <STDIN>;
chop $out;

unless ( open(IN, $in) ) {
die ("cant input file $in\n");}

unless ( open(OUT, ">$out") ) {
die("cant open output file $out\n");}

my $line = <IN>;
print OUT $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~/^>(.+)/) {
print OUT "\n",$line,"\n"; }
else { $line =~ s/^\s*(.*)\s*$/$1/;

$a=($line=~tr/aA//);
$c=($line=~tr/cC//);
$g=($line=~tr/gG//);
$t=($line=~tr/tT//);
$n=($line=~tr/nN//);
$x=($line=~tr/xX//);

$length = $a + $c + $g + $t + $n + $x;
print OUT $length; }
}

print OUT "\n";
>gi|546265522| SOX6
121212


this numbers came from the length of each line but printed in the same line with any space then the number is like "121212" you should read "12-12-12" and if you add in the line marked in red the "-" you obtain this. A way to get the total length of every contig, you have to sum the length of the lines that form each contig and when you find another contig put the length equal to 0 and sum the lines of each contig. If you need more help I modify the code but I think you can modify it.
jesus garcia is offline   Reply With Quote
Old 11-06-2012, 01:42 PM   #4
newbie2this
Junior Member
 
Location: VA

Join Date: Nov 2012
Posts: 4
Default

thanks bunches for the help but I still wasn`t able to get my script to run the way i want it to.
newbie2this is offline   Reply With Quote
Old 11-06-2012, 02:01 PM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

n2t,

One thing I almost always do when dealing with FASTA files in perl is to change the input record separator from the default newline to the more useful (for FASTA records anyway) ">". This allows me to deal with the full FASTA record, definition line and sequence as a single initial chunk of data instead of line by line. Additionally perl has a "length" function; you don't have to count and add characters.

Here is a heavily commented bit of perl which will accomplish what you asked.

Code:
#!/usr/bin/perl

use strict;
use warnings;

my $inFile = shift; 
open (IN, "$inFile");

# By default Perl pulls in chunks of text up to a newline (\n) character; newline is
# the default Input Record Separator. You can change the Input Record Separator by
# using the special variable "$/". When dealing with FASTA files I normally change the
# Input Record Separator to ">" which allows your script to take in a full, multiline
# FASTA record at once.

$/ = ">";

# At each input your script will now read text up to and including the first ">" it encounters.
# This means you have to deal with the first ">" at the begining of the file as a special case.

my $junk = <IN>; # Discard the ">" at the begining of the file

# Now read through your input file one sequence record at a time. Each input record will be a
# multiline FASTA entry.

while ( my $record = <IN> ) {
	chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record
	
# 	Now split up your record into its definition line and sequence lines using split at each newline.
# 	The definition will be stored in a scalar variable and each sequence line as an
# 	element of an array.
	
	my ($defLine, @seqLines) = split /\n/, $record;
	
#	Join the individual sequence lines into one single sequence and store in a scalar variable.
	
	my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
	
	print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.
	
	print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline

}
kmcarr is offline   Reply With Quote
Old 11-06-2012, 06:31 PM   #6
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

You'll do yourself a huge favor if you resist reinventing the wheel (not that I always follow this advice!).

In particular, take a look at Bio::SeqIO -- this will let you read & write many sequence formats, including FASTA. Your particular program is roughly (taking the input file from the command line and outputting to STDOUT)

Code:
#!/usr/bin/perl
use strict;
use Bio::SeqIO;
my $reader=new Bio::SeqIO(-format=>'fasta',-file=>shift);
while (my $seqRec=$reader->next_seq)
{
    print join("\t",$seqRec->id,$seqRec->length),"\n";
}
Of course, what you were attempting will get you up to speed on standard Perl, but you'll go far learning the higher-level idioms that BioPerl and other libraries give you, and still have plenty to do with the basics.
krobison is offline   Reply With Quote
Old 11-07-2012, 05:45 AM   #7
newbie2this
Junior Member
 
Location: VA

Join Date: Nov 2012
Posts: 4
Default

Quote:
Originally Posted by kmcarr View Post
n2t,

One thing I almost always do when dealing with FASTA files in perl is to change the input record separator from the default newline to the more useful (for FASTA records anyway) ">". This allows me to deal with the full FASTA record, definition line and sequence as a single initial chunk of data instead of line by line. Additionally perl has a "length" function; you don't have to count and add characters.

Here is a heavily commented bit of perl which will accomplish what you asked.

Code:
#!/usr/bin/perl

use strict;
use warnings;

my $inFile = shift; 
open (IN, "$inFile");

# By default Perl pulls in chunks of text up to a newline (\n) character; newline is
# the default Input Record Separator. You can change the Input Record Separator by
# using the special variable "$/". When dealing with FASTA files I normally change the
# Input Record Separator to ">" which allows your script to take in a full, multiline
# FASTA record at once.

$/ = ">";

# At each input your script will now read text up to and including the first ">" it encounters.
# This means you have to deal with the first ">" at the begining of the file as a special case.

my $junk = <IN>; # Discard the ">" at the begining of the file

# Now read through your input file one sequence record at a time. Each input record will be a
# multiline FASTA entry.

while ( my $record = <IN> ) {
	chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record
	
# 	Now split up your record into its definition line and sequence lines using split at each newline.
# 	The definition will be stored in a scalar variable and each sequence line as an
# 	element of an array.
	
	my ($defLine, @seqLines) = split /\n/, $record;
	
#	Join the individual sequence lines into one single sequence and store in a scalar variable.
	
	my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
	
	print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.
	
	print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline

}
with some minor changea and addition to suit my objective, this worked like a charm. Thanks for all the help.
newbie2this is offline   Reply With Quote
Old 09-11-2013, 04:29 AM   #8
ebioman
Member
 
Location: Switzerland

Join Date: Aug 2013
Posts: 41
Talking

Thanks,
that script was very useful and simplified my attempt a lot.
From a perl script of view, a lot of things can be simplified.
I was only interested in the e.g. length (for statistics):

while(<>){
chomp;
@split = split /\n/;
push(@size_list, length $split[1]);

}

shift @size_list;

Still, I guess lot of ways to improve and do it another way
ebioman is offline   Reply With Quote
Old 09-11-2013, 05:29 AM   #9
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by ebioman View Post
Thanks,
that script was very useful and simplified my attempt a lot.
From a perl script of view, a lot of things can be simplified.
I was only interested in the e.g. length (for statistics):

while(<>){
chomp;
@split = split /\n/;
push(@size_list, length $split[1]);

}

shift @size_list;

Still, I guess lot of ways to improve and do it another way
You really should follow the best practices and use modern pragmas. That means putting
Code:
use strict; 
use warnings;
and the top of your script and use lexical variables to limit their scope and avoid collisions. Writing in a very minimal style might mean a little less typing, but it will also mean that your script will fail silently, which won't save you any time in the long run.
SES is offline   Reply With Quote
Old 09-11-2013, 05:48 AM   #10
ebioman
Member
 
Location: Switzerland

Join Date: Aug 2013
Posts: 41
Default

My fault, I actually only copied a snippet, since my program is way larger.
But you are right:

Code:
#!/usr/bin/perl
use warnings;
use diagnostics;
use strict;

$/ = ">";
my (@size_list,@split);

while(<>){
     chomp;
     @split = split /\n/;
     push(@size_list, length $split[1]);
   
}

shift @size_list;
Thx
ebioman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO