Hi all,
I'm using the Smith-Waterman algorithm of EMBOSS to automaticaly do pairwise sequence alignments.
Afterwards I want to filter them by an identity and length thereshold. To do that I would like to parse the output. All infos are contained in the default output format which looks like
I tried to parse the file with perl, but there is a newline shift somewhere when the alignment ends with a "." I think.
Did anyone allready wrote a script to parse the EMBOSS water output into a tabular format for "multiple pairwise alignments"? Or can anyone tell me how to solve this readin newline problem? My parsing code is the following:
What is not good practice at all: I have to make it a bit more flexible, for the detection of first alignment line, because that depends on the number of input options (e.g. adding an option results in a shift of the alignments (one extra line)).
Thank you for your reply!
I'm using the Smith-Waterman algorithm of EMBOSS to automaticaly do pairwise sequence alignments.
Afterwards I want to filter them by an identity and length thereshold. To do that I would like to parse the output. All infos are contained in the default output format which looks like
#=======================================
#
# Aligned_sequences: 2
# 1: 3end_LTR_oligo
# 2: Cap_S1.1
# Matrix: EDNAFULL
# Gap_penalty: 15.0
# Extend_penalty: 1.5
#
# Length: 9
# Identity: 8/9 (88.9%)
# Similarity: 8/9 (88.9%)
# Gaps: 0/9 ( 0.0%)
# Score: 36.0
#
#
#=======================================
3end_LTR_olig 19 GTTCGCGTT 27
|||||.|||
Cap_S1.1 2 GTTCGTGTT 10
#=======================================
#
# Aligned_sequences: 2
# 1: 3end_LTR_oligo
# 2: Cap_S1.1
# Matrix: EDNAFULL
# Gap_penalty: 15.0
# Extend_penalty: 1.5
#
# Length: 9
# Identity: 8/9 (88.9%)
# Similarity: 8/9 (88.9%)
# Gaps: 0/9 ( 0.0%)
# Score: 36.0
#
#
#=======================================
3end_LTR_olig 19 GTTCGCGTT 27
|||||.|||
Cap_S1.1 2 GTTCGTGTT 10
#=======================================
Did anyone allready wrote a script to parse the EMBOSS water output into a tabular format for "multiple pairwise alignments"? Or can anyone tell me how to solve this readin newline problem? My parsing code is the following:
Code:
#! /usr/bin/perl -w use strict; #Author: Ulrike Löber ([email protected]) #This script should parse default output of EMBOSS's Waterman algorithm #INFILE is the water outfile (srspair format) open (INFILE, "$ARGV[0]") or die "file $ARGV[0] not found"; my @data = <INFILE>; close INFILE; my $numberoflines=@data; my @programm = split(/ +/,$data[1]); my @rundate = split(/ +/,$data[2]); my @gapopenpenalty = split(/ +/,$data[8]); my @gapextensionpenalty = split(/ +/,$data[9]); # main informations are print out in terminal, e.g. programm name, date of the run and parameters print "programm: $programm[2]\n"; print "rundate: $rundate[2] $rundate[3] $rundate[4] $rundate[5] $rundate[6]\n"; print "gap open penalty: $gapopenpenalty[2]\ngap extension penalty: $gapextensionpenalty[2]\n"; #parsed information are written in a file with the same name as the infile with an additional ".tabular" open (OUTFILE,">$ARGV[0].tabular"); # build a table header print OUTFILE "id1\tid2\talignment length\tidentity\tsimilarity\tgaps\tscore\tstart1\tend1\tseq1\tstart2\tend2\tseq2\n"; for (my $i = 0; $i <= 1220; $i++) { my $nextalign=$i*23; my @id1 = split(/ +/,$data[17+$nextalign]); #splitting stings by whitespaces my @id2 = split(/ +/,$data[18+$nextalign]); my @length = split(/ +/,$data[23+$nextalign]); my @identity = split(/\(/,$data[24+$nextalign]); my @similarityscore = split(/\(/,$data[25+$nextalign]); my @gaps = split(/\(/,$data[26+$nextalign]); #whitespace problems if there is an one digit percentage an additional space is in the line my @alignmentscore = split(/ +/,$data[27+$nextalign]); #extraction by "(" instead of whitespace my @seq1 = split(/ +/,$data[32+$nextalign]); my $startseq1 = $seq1[1]; #simple renaming to make the "prints" more readable my $endseq1 = $seq1[3]; my @seq2 = split(/ +/,$data[34+$nextalign]); my $startseq2 = $seq2[1]; my $endseq2 = $seq2[3]; $identity[1]=~s/\%\)//g; #the percentage sign and the brace must be removed $similarityscore[1]=~s/\%\)//g;; $gaps[1]=~s/\%\)//g; # chomp every scalar to remove leading white spaces from the strings chomp $id1[2]; chomp $id2[2]; chomp $length[2]; chomp $identity[1]; chomp $similarityscore[1]; chomp $gaps[1]; chomp $alignmentscore[2]; chomp $startseq1; chomp $endseq1; chomp $seq1[2]; chomp $startseq2; chomp $endseq2; chomp $seq2[2]; # gives selected information for each pairwise alignment print OUTFILE "$id1[2]\t$id2[2]\t$length[2]\t$identity[1]\t$similarityscore[1]\t$gaps[1]\t$alignmentscore[2]\t$startseq1\t$endseq1\t$seq1[2]\t$startseq2\t$endseq2\t$seq2[2]\n"; } close OUTFILE; print "OUTPUT written in $ARGV[0].tabular\n";
Thank you for your reply!
Comment