Seqanswers Leaderboard Ad

**d1antho** · 03-07-2013, 08:36 AM

For the example given, are looking for an output like:

contig00345 contig00216 77
contig00345 contig00004 16
etc

Could you give an example of the output you are looking for?

**AmyEllison** · 03-07-2013, 08:42 AM

Hi d1antho,

Yep, that's exactly what I'm after!

Thanks, Amy

**JackieBadger** · 03-07-2013, 08:49 AM

Just export into excel and use the MATCH and INDEX function

**d1antho** · 03-07-2013, 09:02 AM

Hi Amy

This code should do it.

Code:

#!/usr/bin/perl -w
use strict;

#open th econtig file
open(CONTIG, "$ARGV[0]") or die "Error opening the input file with contig IDs";

#hash to store contig IDs ie 1,2,3 and values ie contig00345,contig00216 etc
my %contigs;

#read through the contig file and read into memory
while(<CONTIG>){
	chomp $_;	#get rid of ending whitepace
	
	my @list = split("\t", $_); #split the current line on any tabs
	
	$contigs{$list[0]} = $list[1];	#place the contig ID and value into the HASH
	#note if a value from your 1st columns appear more than once in the file the value will get over-written
	#if this is the case let me know and I'll write another script

}

#close filehandle
close(CONTIG);

#open output file
open(OUT, ">$ARGV[2]") or die "Error opening the output file";

#open sequence pairs file
open(SEQS, "$ARGV[1]") or die "Error opening the sequence pairs file\n";

while(<SEQS>){
	chomp $_;
	
	my @array = split("\t", $_);
	
	#print contig name corresponding to the value in columns 1 and 2 of the seq pair file and the identity score
	print OUT "$contigs{$array[0]}\t$contigs{$array[1]}\t$array[2]\n";
	
}

#close remaining file handles
close(SEQS);

close(OUT);

This has not been tested but it is commented for you. Especially because you are new to perl I've tried to keep the code simple. If you have tens-hundreds of thousands of lines in the file excel wont handle it as well as perl would (and this might not be possible if you have an older version of excel). Perl 'excels' at this type of file manipulation. Also just to note, if a value in the first column of your contigs file appears more than once this will only take the last value but I suspect that wont be an issue for your data as it doesnt make sense to label 2 different contigs with the same value. If this is the case another script will be needed.

You should save this script as for example id_match.pl so
to run this script the command should look like:

Code:

perl id_match.pl contig_file sequence_pair_file OUTPUT

If you have any problems, let me know

Anthony

**AmyEllison** · 03-07-2013, 09:03 AM

Thanks for the suggestion Jackie, not familiar with these functions so had a quick google....

I have each number in the 1st file multiple times - I just tried the MATCH function and it only returns the first cell it finds the value in.

Am I missing something really obvious? Could you be more specific in how I could use these functions?

Thank you for your time.

**AmyEllison** · 03-07-2013, 09:06 AM

Anthony,

Wow that was quick! Thank you so much - I'm going to try it out now!

Amy

**d1antho** · 03-07-2013, 09:06 AM

This wont be an issue with the perl script. You could also use the Vlookup function if you want to stay with excel

**JackieBadger** · 03-07-2013, 09:11 AM

Hopefully the attachment makes sense.

I split the two functions up for clarity..you can embed MATCH within INDEX for brevity

Attached Files

MatchIndexTemp.jpg (93.0 KB, 34 views)

**AmyEllison** · 03-07-2013, 09:24 AM

Thanks Jackie - I'm going to try this too - good to learn!

Anthony - I'm getting an error:

Bareword "CONTIG" not allowed while "strict subs" in use at id_match.pl line 23.

**d1antho** · 03-07-2013, 09:35 AM

Just saw that myself. The error has been fixed in the code above. It was in the first while loop I forgot to loop the file handle. Basically

I had
while(CONTIG){

instead of
while(<CONTIG>){

Sorry about that. Should work now.

Anthony

**AmyEllison** · 03-07-2013, 09:39 AM

Brilliant - working perfectly now - thanks again!

I've now just got to go through it to make sure I know how it works...great learning exercise.

I blinking love this forum!

**tsully87** · 03-07-2013, 09:58 AM

Python solution:

Code:

import sys

contig_dict = {}

for line in open(sys.argv[2],'r'):
        contig_dict[line.split('\t')[0]] = line.split('\t')[1].strip()

for line in open(sys.argv[1],'r'):
        sline = line.split('\t')
        print '\t'.join([contig_dict[sline[0]],contig_dict[sline[1]], sline[2]]),

python combiner.py datafile.tab contigfile.tab

**crazyhottommy** · 03-07-2013, 10:22 AM

look at here http://stackoverflow.com/questions/6...-by-dictionary

you may get an idea how to do it.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Script help please! Replace numbers with sequence IDs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News