SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract only sequence ids from fasta file with makeblastdb angeloulivieri Bioinformatics 13 07-30-2012 02:41 AM
A can of Worms - Which all purpose aligner to replace BWA 0.5.7 for Illumina data? rcorbett Bioinformatics 0 03-28-2012 03:14 PM
PCR Enrichment - Replace TruSeq PPC and PMM?! Akira Illumina/Solexa 3 03-09-2012 06:57 AM
How to replace select reads in a bam file? Heisman Bioinformatics 8 01-02-2012 02:49 PM
Buying reagents from outside source to replace the 454 DNA library kit lmsiew 454 Pyrosequencing 2 07-07-2009 01:39 PM

Reply
 
Thread Tools
Old 03-07-2013, 07:25 AM   #1
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default Script help please! Replace numbers with sequence IDs

I have a tab delimited file with sequence pairs and their identity score - eg:

1 2 77
1 3 16
1 4 23 etc

And a separate tab delimited file with the actual sequence IDs - eg:

1 contig00345
2 contig00216
3 contig00004 etc

I want to replace the numbers (in 1st and 2nd column) in the first file with the sequence ID. I'm only just starting to learn Perl scripting, I'm sure this is quite easy - please could someone help me out here?
AmyEllison is offline   Reply With Quote
Old 03-07-2013, 07:36 AM   #2
d1antho
Member
 
Location: Ireland

Join Date: Mar 2012
Posts: 15
Default

For the example given, are looking for an output like:

contig00345 contig00216 77
contig00345 contig00004 16
etc

Could you give an example of the output you are looking for?
d1antho is offline   Reply With Quote
Old 03-07-2013, 07:42 AM   #3
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Hi d1antho,

Yep, that's exactly what I'm after!

Thanks, Amy
AmyEllison is offline   Reply With Quote
Old 03-07-2013, 07:49 AM   #4
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

Just export into excel and use the MATCH and INDEX function
JackieBadger is offline   Reply With Quote
Old 03-07-2013, 08:02 AM   #5
d1antho
Member
 
Location: Ireland

Join Date: Mar 2012
Posts: 15
Default

Hi Amy

This code should do it.
Code:
#!/usr/bin/perl -w
use strict;

#open th econtig file
open(CONTIG, "$ARGV[0]") or die "Error opening the input file with contig IDs";

#hash to store contig IDs ie 1,2,3 and values ie contig00345,contig00216 etc
my %contigs;

#read through the contig file and read into memory
while(<CONTIG>){
	chomp $_;	#get rid of ending whitepace
	
	my @list = split("\t", $_); #split the current line on any tabs
	
	$contigs{$list[0]} = $list[1];	#place the contig ID and value into the HASH
	#note if a value from your 1st columns appear more than once in the file the value will get over-written
	#if this is the case let me know and I'll write another script

}

#close filehandle
close(CONTIG);

#open output file
open(OUT, ">$ARGV[2]") or die "Error opening the output file";

#open sequence pairs file
open(SEQS, "$ARGV[1]") or die "Error opening the sequence pairs file\n";

while(<SEQS>){
	chomp $_;
	
	my @array = split("\t", $_);
	
	#print contig name corresponding to the value in columns 1 and 2 of the seq pair file and the identity score
	print OUT "$contigs{$array[0]}\t$contigs{$array[1]}\t$array[2]\n";
	
}

#close remaining file handles
close(SEQS);

close(OUT);
This has not been tested but it is commented for you. Especially because you are new to perl I've tried to keep the code simple. If you have tens-hundreds of thousands of lines in the file excel wont handle it as well as perl would (and this might not be possible if you have an older version of excel). Perl 'excels' at this type of file manipulation. Also just to note, if a value in the first column of your contigs file appears more than once this will only take the last value but I suspect that wont be an issue for your data as it doesnt make sense to label 2 different contigs with the same value. If this is the case another script will be needed.

You should save this script as for example id_match.pl so
to run this script the command should look like:
Code:
perl id_match.pl contig_file sequence_pair_file OUTPUT
If you have any problems, let me know

Anthony

Last edited by d1antho; 03-07-2013 at 08:32 AM.
d1antho is offline   Reply With Quote
Old 03-07-2013, 08:03 AM   #6
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Thanks for the suggestion Jackie, not familiar with these functions so had a quick google....

I have each number in the 1st file multiple times - I just tried the MATCH function and it only returns the first cell it finds the value in.

Am I missing something really obvious? Could you be more specific in how I could use these functions?

Thank you for your time.
AmyEllison is offline   Reply With Quote
Old 03-07-2013, 08:06 AM   #7
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Anthony,

Wow that was quick! Thank you so much - I'm going to try it out now!

Amy
AmyEllison is offline   Reply With Quote
Old 03-07-2013, 08:06 AM   #8
d1antho
Member
 
Location: Ireland

Join Date: Mar 2012
Posts: 15
Default

This wont be an issue with the perl script. You could also use the Vlookup function if you want to stay with excel
d1antho is offline   Reply With Quote
Old 03-07-2013, 08:11 AM   #9
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

Hopefully the attachment makes sense.

I split the two functions up for clarity..you can embed MATCH within INDEX for brevity
Attached Images
File Type: jpg MatchIndexTemp.jpg (93.0 KB, 10 views)

Last edited by JackieBadger; 03-07-2013 at 08:14 AM.
JackieBadger is offline   Reply With Quote
Old 03-07-2013, 08:24 AM   #10
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Thanks Jackie - I'm going to try this too - good to learn!

Anthony - I'm getting an error:

Bareword "CONTIG" not allowed while "strict subs" in use at id_match.pl line 23.
AmyEllison is offline   Reply With Quote
Old 03-07-2013, 08:35 AM   #11
d1antho
Member
 
Location: Ireland

Join Date: Mar 2012
Posts: 15
Default

Just saw that myself. The error has been fixed in the code above. It was in the first while loop I forgot to loop the file handle. Basically

I had
while(CONTIG){


instead of
while(<CONTIG>){

Sorry about that. Should work now.

Anthony
d1antho is offline   Reply With Quote
Old 03-07-2013, 08:39 AM   #12
AmyEllison
Member
 
Location: Ithaca, NY

Join Date: Nov 2012
Posts: 16
Default

Brilliant - working perfectly now - thanks again!

I've now just got to go through it to make sure I know how it works...great learning exercise.

I blinking love this forum!
AmyEllison is offline   Reply With Quote
Old 03-07-2013, 08:58 AM   #13
tsully87
Junior Member
 
Location: Boston, MA

Join Date: May 2011
Posts: 1
Default

Python solution:
Code:
import sys

contig_dict = {}

for line in open(sys.argv[2],'r'):
        contig_dict[line.split('\t')[0]] = line.split('\t')[1].strip()

for line in open(sys.argv[1],'r'):
        sline = line.split('\t')
        print '\t'.join([contig_dict[sline[0]],contig_dict[sline[1]], sline[2]]),
python combiner.py datafile.tab contigfile.tab
tsully87 is offline   Reply With Quote
Old 03-07-2013, 09:22 AM   #14
crazyhottommy
Senior Member
 
Location: Gainesville

Join Date: Apr 2012
Posts: 140
Default

look at here http://stackoverflow.com/questions/6...-by-dictionary

you may get an idea how to do it.
crazyhottommy is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:23 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO