Seqanswers Leaderboard Ad

**zhidkov.ilia** · 11-05-2013, 03:55 AM

Hi,
you can change your alignment format output :

emboss/bin/water seq1 seq2 -gapopen 10.0 -gapextend 0.5 -aformat score -outfile aln.tab

Output will show only the names of the sequences, the length of the alignment and the score.

for additional output options check: http://emboss.sourceforge.net/docs/t...gnFormats.html

Ilia

**gringer** · 11-05-2013, 03:56 AM

Holy specifics, batman! This looks nothing like the Perl I'm used to. You've got the following lines:

Code:

open (INFILE, "$ARGV[0]") or die "file $ARGV[0] not found";
	my @data	=	<INFILE>;
close INFILE;
# process @data

Which has an assumption that the entire file can fit into memory. Okay, that's probably true for your SW files, but it's not true more generally. You don't seem to be using any of the nice features of Perl, and treating it mostly like an array-storage utility. What I'm more used to is something like this:

Code:

my %variables = ();
open (INFILE, "$ARGV[0]") or die "file $ARGV[0] not found";
while(<INFILE>){
  # process current line
  if(/pattern1(.*)pattern2/){
    $variables{pattern1} = $1;
  }
  if(/endPattern/){
    # process %variables and print to OUTFILE
  }
}
close INFILE;

There are plenty of optimisations and generalisations that can be done with your code [e.g. using join("\t",(varia,bles)) instead of print("varia\tbles")], but you are going to save yourself a whole lot of pain by going back to your tablet and rewriting this to be a line-wise pattern-based process.

**uloeber** · 11-05-2013, 04:02 AM

Thank you for your fast reply.
@zhidkov.ilia: thank you, but I already know the outformats. What I need to keep are the identities and the start and end points of the alignments.
@gringer: anyway the readin is memory efficient or not (thank you for the hint), how can I get rid of the newline problem??

**gringer** · 11-05-2013, 12:20 PM

Originally posted by uloeber View Post

how can I get rid of the newline problem?

You have constants all through your code where they don't need to be (e.g. 1220, as a very obvious one) -- that's the main cause of this specific problem. Your code should be changed considerably, and while I could suggest a change that could fix this specific problem, it would only be kicking the can further down the road.

**uloeber** · 11-05-2013, 02:05 PM

Ah, yes excuse me, this was just for test reasons in the original code its $linenumber devided by 23.

**gringer** · 11-06-2013, 03:01 AM

Originally posted by gringer View Post

Your code should be changed considerably, and while I could suggest a change that could fix this specific problem, it would only be kicking the can further down the road.

Okay, I'll try saying that in a different way. Your code looks like it was written by someone who is trying out Perl as their first programming language and has only studied arrays and the very basics of regular expressions. There are other ways to do what you've done with considerably less frustration; please read up on good style for perl code.

You can find a tutorial about how to write perl code here:

Beginner's Introduction to Perl - Part 2

http://www.perl.com/pub/2000/11/begperl2.html

Editor's note: this venerable series is undergoing updates. You might be interested in the newer versions, available at: A Beginner's Introduction to Perl 5.10 A Beginner's Introduction to Files and Strings with Perl 5.10 A Beginner's Introduction to Regular...

If you want a bit less hand-holding, look at the examples in perl documentation, for example here.

**uloeber** · 11-12-2013, 03:23 AM

Solved

I solved it, maybe it's helpful for people with the same problem.

Code:

#! /usr/bin/perl 
use strict;

# this script is written by Ulrike Löber (contact [email protected])
# function: parsing "multiple" EMBOSS pairwise (Smith-Waterman) sequence alignments
# output is a tabular summary of the results containing information of identity, start and end points,...
# this is missing in default outputs, complete information not contained in tabular output by EMBOSS

#start with read in of the data

open (INFILE, "$ARGV[0]") or die "no such file $ARGV[0]\n";

#initialize variables
my $bool=0;
my $check_odd_even;
my @splitstring;
my $id1;
my $id2;
my $length;
my $identity;
my $similarity;
my $gaps;
my $score;
my $start1;
my $end1;
my $seq1;
my $start2;
my $end2;
my $seq2;
my $pattern_id1;
my $pattern_id2;

# initialize outfile header 
my $outfile="$ARGV[0].tab";
open (OUTFILE, ">$outfile") or die "error creating file $outfile !\n";
print OUTFILE "id1\tid2\tlength\tidentity\tsimilarity\tgaps\tscore\tstart1\tseq1\tend1\tstart2\tseq2\tend2\n";
 while (defined(my $line=<INFILE>)) {
	chomp $line;
	$line=~s/\R//g;	#remove linebreaks
# first information to extract are the "method" information
	if ($line	=~	m/\#{10,}/g){	#there should be only two lines with at least 10 # in it, before and after the description at the beginning of the file
		$bool	=	$bool+1;
		$check_odd_even=$bool%2;
	}
	if	($check_odd_even  	==	1){
#		print "$line\n";
	}
# following the alignment information, written in the new outfile
	elsif	($check_odd_even	==	0){
		if($line	=~	m/\#\ 1\:/g){			#if line structure is like "# 1: seqname" print out third element of the line 
			@splitstring	=	split(/ +/,$line);	#split sting by whitespaces
			$id1		=	$splitstring[2];	
			print OUTFILE "$id1\t";
			$pattern_id1=substr($id1,0,13);			# attention! alignments: seqids are croped to 13 signs! 
		}	
		elsif($line	=~	m/\#\ 2\:/g){			#if line structure is like "# 2: seqname" print out third element of the line
			@splitstring	=	split(/ +/,$line);	
			$id2		=	$splitstring[2];
			print OUTFILE "$id2\t";
			$pattern_id2=substr($id2,0,13);			# attention! alignments: seqids are croped to 13 signs! 
		}
		elsif($line=~m/\#\ Length\:/ig){			#if line structure is like "# Length: 24" print out third element of the line
			@splitstring	=	split(/ +/,$line);
			$length		=	$splitstring[2];
			print OUTFILE "$length\t";
		}
		elsif($line=~m/\#\ Identity\:/ig){			#if line structure is like "# Identity: 9/10 (90.0%)" 
			@splitstring=split(/ +/,$line);			#split sting by whitespaces
			$identity	=	$splitstring[3];	#fourth element is the identity in percentage
			$identity	=~	s/[\(,\),\%]//g;	#cut off the braces and %-sign
			print OUTFILE "$identity\t";
		}
		elsif($line=~m/\#\ Similarity\:/ig){
			@splitstring=split(/ +/,$line);
			$similarity	=	$splitstring[3];
			$similarity	=~	s/[\(,\),\%]//g;
			print OUTFILE "$similarity\t";
		}
		elsif($line=~m/\#\ Gaps\:/ig){
			@splitstring=split(/ +/,$line);
			$gaps	=	$splitstring[3];
			$gaps	=~	s/[\(,\),\%]//g;
			print OUTFILE "$gaps\t";
		}
		elsif($line=~m/\#\ Score\:/ig){
			@splitstring	=	split(/ +/,$line);
			$score		=	$splitstring[2];
			print OUTFILE "$score\t";
		}
		
		
		elsif($line=~m/^$pattern_id1\ /g){
			@splitstring	=	split(/ +/,$line);
			$start1		=	$splitstring[1];
			$seq1		=	$splitstring[2];
			$end1		=	$splitstring[3];
			print OUTFILE "$start1\t$seq1\t$end1\t";
		}
		
		
		elsif($line=~m/^$pattern_id2\ /g){
			@splitstring	=	split(/ +/,$line);
			$start2		=	$splitstring[1];
			$seq2		=	$splitstring[2];
			$end2		=	$splitstring[3];
			print OUTFILE "$start2\t$seq2\t$end2\t";
		}
		elsif($line=~m/Aligned_sequences/g){			#insert linebreak, when next alignment starts
			print OUTFILE "\n";
		}
	}
}
close OUTFILE;
print "\n Now parsed to tabular format and written in $outfile\n ";

**gringer** · 11-12-2013, 11:32 AM

Much better! Thanks for sharing your solution.

**SES** · 11-13-2013, 06:19 AM

Originally posted by gringer View Post

You can find a tutorial about how to write perl code here:

Beginner's Introduction to Perl - Part 2

http://www.perl.com/pub/2000/11/begperl2.html

Editor's note: this venerable series is undergoing updates. You might be interested in the newer versions, available at: A Beginner's Introduction to Perl 5.10 A Beginner's Introduction to Files and Strings with Perl 5.10 A Beginner's Introduction to Regular...

That article was published more than 13 years ago. The single best piece of advice I could give for selecting a Perl guide is to look at the publication date. Unfortunately, there are a lot of how-to guides on the internet that are really out of touch with modern Perl.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Parsing Smith-Waterman Alignment EMBOSS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News