Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [PERL] Text files manipulation/reorganisation

    Hello everyone,
    guess what, I got a question! Here it is:
    I got two files organised a little like fasta files (but they are txt files), that look like this:
    >1 count:272019
    TACCTGGTTGATCCTGCCAG
    >2 count:48613
    TTTGGATTGAAGGGAGCTCTA
    >3 count:15422
    TTTGGATTGAAGGGAGCTCT
    >4 count:9818
    TTGGACTGAAGGGAGCT
    >5 count:8783
    TTGGACTGAAGGGAGCTCCCT
    These two files contain the same sequences and some sequences that are only in one of the files; these sequences are not in the same order in the two files and for an identical sequence, the count value isn't the same in the two files. What I want to do is order the two files so that the sequences are in the same order in each of them. I also want to modify the name, so that it consists only of the value of count. In fact, what I want to obtain is, for example:
    272019
    TACCTGGTTGATCCTGCCAG
    48613
    TTTGGATTGAAGGGAGCTCTA
    15422
    TTTGGATTGAAGGGAGCTCT
    9818
    TTGGACTGAAGGGAGCT
    8783
    TTGGACTGAAGGGAGCTCCCT
    I also want that if a sequence is only in file1 for exemple, in file2 that sequence appears but with a count value of 0. Same thing if the sequence is only in file2, I want that in file1 it appears with a count value of 0.
    So i wrote a script that seems to work for sorting my sequences in the same order in the two files; I say seems to work cause now, the two files have the same nomber of lines of sequences and they are in the same order.
    But I have a problem: I don't know how to extract the count value and keep it associated with its sequence and how to say I want this value to be 0 if my sequence is only in one file. Cause, if I understood well the principle of hash, with my sequences having a different count value in each file, I can't do a

    Code:
    if (exists $results{$count}
    {
       $results{$count}++;
    }
    It only works for identical values, no?

    Well all that to say I'm lost and blocked on this part of my script, so if anyone can help me, I will really appreciate it. Thank you very much.

    By the way, here is the code I wrote, that works for sorting my sequences (if anybody thinks this code is not really good and could be optimised, don't hesitate to tell!):
    Code:
    use warnings;
    use strict;
    my $fast1="C:/Users/Moi/fichier1.fasta";
    open (my $IN1, "<", $fast1) or die "Impossible d'ouvrir le fichier $fast1 $!";
    my $fast2="C:/Users/Moi/fichier2.fasta";
    open (my $IN2, "<", $fast2) or die "Impossible d'ouvrir le fichier $fast2 $!";
    my $trie1="C:/Users/Moi/fichier1bis.fasta";
    open (my $OUT1, ">", $trie1) or die "Impossible d'ouvrir le fichier $trie1 $!";
    my $trie2="C:/Users/Moi/fichier2bis.fasta";
    open (my $OUT2, ">", $trie2) or die "Impossible d'ouvrir le fichier $trie2 $!";
    my %results;
    #my @tab;
    my ($name, $line);
    while($name = <$IN1> )
    {
        $line=<$IN1>;
        chomp $name;
        chomp $line;
        #@tab = split (/:/, $name);
        #$count=$tab[1];
        $results{$line}=1;
    }
    while($name = <$IN2>)
    {
        $line=<$IN2>;
        chomp $name;
        chomp $line;
        #@tab = split (/:/, $name);
        #$count=$tab[1];
        if (exists $results{$line})
        {
            $results{$line}++;
        }
    }
    foreach $line (keys %results)
    {
       if ($results{$line} == 1)
       {
            print $OUT1 "$line\n";
       }
       if ($results{$line}==2)
       {
            print $OUT1 "$line\n";
            print $OUT2 "$line\n";
       }
      else
      {
            print $OUT2 "$line\n";
       }
    }
    close ($IN1);
    close ($IN2);
    close ($OUT1);
    close ($OUT2);
    The commented lines are an idea I got to extract my count values, but I didn't see how to continue on it, so...

    Once again, thanks for any help you can give me!
    Have a nice day!

  • #2
    PERL] Text files manipulation/reorganisation

    Hi,

    I would use a hash of arrays.

    So you would have $results{$line}->[0] would be 1 for sequences that
    are found in file1 ($results{$line}->[0] = 1),

    and

    $results{$line}->[1] would be 1 for sequences that
    are found in file2 ($results{$line}->[1] = 1).

    when you read through the first file, you could set the counts for that sequence in the second file to 0 ($results{$line}->[1] = 0).

    Then when you are reading through the second file, check if that sequence already exits, if it doesn't, set the count for file1 to 0 ($results{$line}->[0] = 0).

    When you print the contents of the hash, if ($resuilts{$line}->[0] is 0, print that line to output file to only, etc.

    Hope this helps,
    maria

    Comment


    • #3
      Thanks for your answer but I have some questions on it: the way I understand it, I will only have the unique sequences in my output file,no? Sorry i'm a newbie in Perl, still learning and sometimes even evident things are not so evident for me...
      In fact I don't understand how the hash af array will allow me to save the count value and keep it associated with the corresponding sequence.

      Comment


      • #4
        [PERL] Text files manipulation/reorganisation

        You're right, I forgot that you need to have the counts associated with your
        sequence.

        But the hash of arrays will still work:


        When you read through file 1,

        $results{$line}->[0] = $name;
        $results{$line}->[1] = 0;

        When you read through file 2,

        if (exists $line){
        $results{$line}->[1] = $name;
        }
        else{
        $results{$line}->[0] =0;
        $results{$line}->[1] = $name;
        }

        when you go through the hash to print the two output files,
        do something like:

        foreach $line (keys %results){

        print OUT1 "$line\t $results{$line}->[0]\n";
        print OUT2 "$line\t $results{$line}->[1]\n";
        }


        the hash keys are unique, but the values associated with each key are (anonymous) arrays instead of single (scalar) values.

        so you store the counts for file1 at position [0] of each array, and the counts for file2 qt position [1] of each array.

        hopefully this should work.

        Comment


        • #5
          Oh ok! Thank you so much, I understand how it works now. I will try to write the script, thank you!

          Here I am again: I did as you told me an it works perfectly! So once again thank you Maria and thank your for the explanations of what you told me cause now, I know it works and I wil be able to use it for something else if I need.
          Last edited by Kawaccino; 04-07-2013, 11:36 AM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X