Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Script help please! Replace numbers with sequence IDs

    I have a tab delimited file with sequence pairs and their identity score - eg:

    1 2 77
    1 3 16
    1 4 23 etc

    And a separate tab delimited file with the actual sequence IDs - eg:

    1 contig00345
    2 contig00216
    3 contig00004 etc

    I want to replace the numbers (in 1st and 2nd column) in the first file with the sequence ID. I'm only just starting to learn Perl scripting, I'm sure this is quite easy - please could someone help me out here?

  • #2
    For the example given, are looking for an output like:

    contig00345 contig00216 77
    contig00345 contig00004 16
    etc

    Could you give an example of the output you are looking for?

    Comment


    • #3
      Hi d1antho,

      Yep, that's exactly what I'm after!

      Thanks, Amy

      Comment


      • #4
        Just export into excel and use the MATCH and INDEX function

        Comment


        • #5
          Hi Amy

          This code should do it.
          Code:
          #!/usr/bin/perl -w
          use strict;
          
          #open th econtig file
          open(CONTIG, "$ARGV[0]") or die "Error opening the input file with contig IDs";
          
          #hash to store contig IDs ie 1,2,3 and values ie contig00345,contig00216 etc
          my %contigs;
          
          #read through the contig file and read into memory
          while(<CONTIG>){
          	chomp $_;	#get rid of ending whitepace
          	
          	my @list = split("\t", $_); #split the current line on any tabs
          	
          	$contigs{$list[0]} = $list[1];	#place the contig ID and value into the HASH
          	#note if a value from your 1st columns appear more than once in the file the value will get over-written
          	#if this is the case let me know and I'll write another script
          
          }
          
          #close filehandle
          close(CONTIG);
          
          #open output file
          open(OUT, ">$ARGV[2]") or die "Error opening the output file";
          
          #open sequence pairs file
          open(SEQS, "$ARGV[1]") or die "Error opening the sequence pairs file\n";
          
          while(<SEQS>){
          	chomp $_;
          	
          	my @array = split("\t", $_);
          	
          	#print contig name corresponding to the value in columns 1 and 2 of the seq pair file and the identity score
          	print OUT "$contigs{$array[0]}\t$contigs{$array[1]}\t$array[2]\n";
          	
          }
          
          #close remaining file handles
          close(SEQS);
          
          close(OUT);
          This has not been tested but it is commented for you. Especially because you are new to perl I've tried to keep the code simple. If you have tens-hundreds of thousands of lines in the file excel wont handle it as well as perl would (and this might not be possible if you have an older version of excel). Perl 'excels' at this type of file manipulation. Also just to note, if a value in the first column of your contigs file appears more than once this will only take the last value but I suspect that wont be an issue for your data as it doesnt make sense to label 2 different contigs with the same value. If this is the case another script will be needed.

          You should save this script as for example id_match.pl so
          to run this script the command should look like:
          Code:
          perl id_match.pl contig_file sequence_pair_file OUTPUT
          If you have any problems, let me know

          Anthony
          Last edited by d1antho; 03-07-2013, 09:32 AM.

          Comment


          • #6
            Thanks for the suggestion Jackie, not familiar with these functions so had a quick google....

            I have each number in the 1st file multiple times - I just tried the MATCH function and it only returns the first cell it finds the value in.

            Am I missing something really obvious? Could you be more specific in how I could use these functions?

            Thank you for your time.

            Comment


            • #7
              Anthony,

              Wow that was quick! Thank you so much - I'm going to try it out now!

              Amy

              Comment


              • #8
                This wont be an issue with the perl script. You could also use the Vlookup function if you want to stay with excel

                Comment


                • #9
                  Hopefully the attachment makes sense.

                  I split the two functions up for clarity..you can embed MATCH within INDEX for brevity
                  Attached Files
                  Last edited by JackieBadger; 03-07-2013, 09:14 AM.

                  Comment


                  • #10
                    Thanks Jackie - I'm going to try this too - good to learn!

                    Anthony - I'm getting an error:

                    Bareword "CONTIG" not allowed while "strict subs" in use at id_match.pl line 23.

                    Comment


                    • #11
                      Just saw that myself. The error has been fixed in the code above. It was in the first while loop I forgot to loop the file handle. Basically

                      I had
                      while(CONTIG){


                      instead of
                      while(<CONTIG>){

                      Sorry about that. Should work now.

                      Anthony

                      Comment


                      • #12
                        Brilliant - working perfectly now - thanks again!

                        I've now just got to go through it to make sure I know how it works...great learning exercise.

                        I blinking love this forum!

                        Comment


                        • #13
                          Python solution:
                          Code:
                          import sys
                          
                          contig_dict = {}
                          
                          for line in open(sys.argv[2],'r'):
                                  contig_dict[line.split('\t')[0]] = line.split('\t')[1].strip()
                          
                          for line in open(sys.argv[1],'r'):
                                  sline = line.split('\t')
                                  print '\t'.join([contig_dict[sline[0]],contig_dict[sline[1]], sline[2]]),
                          python combiner.py datafile.tab contigfile.tab

                          Comment


                          • #14
                            look at here http://stackoverflow.com/questions/6...-by-dictionary

                            you may get an idea how to do it.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            27 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X