Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Perl Script

    Hi, I have 2 sets of 2 columns of data. One set has a column of gene names and the other column is the number of reads per gene. The second set of columns includes one column also with gene names and the other also with the number of reads. Each set used a different program for mapping the genes and counting reads so the columns dont match up. I would like to plot both sets of columns to see the correlation but first I need to create one column of genes that overlap from both and then include 2 columns of reads next to it. Does anyone have a Perl script that can do this?

  • #2
    Hi,
    You can use unix 'join' command for this task. Please paste first 5 lines of your input/output files, So that one could write the script. You may refer this link: http://www.albany.edu/~ig4895/join.htm.
    Best wishes,
    Rahul
    Last edited by rahularjun86; 10-15-2012, 02:39 AM.
    Rahul Sharma,
    Ph.D
    Frankfurt am Main, Germany

    Comment


    • #3
      Hi, thank you. Here are the first several lines of the file:

      Gene_Horn Uninduced_Horn Gene_DEGSeq Uninduced_DEGSeq
      Tb04.24M18.150 12 Tb04.24M18.150 172
      Tb04.3I12.100 21 Tb04.3I12.100 11
      Tb05.28F8.200 97 Tb05.5K5.100 52
      Tb05.30F7.410 43 Tb05.5K5.10 19
      Tb06.3A7.270 572 Tb05.5K5.110 5
      Tb06.3A7.960 74 Tb05.5K5.120 9
      Tb07.26A24.210 100 Tb05.5K5.130 24
      Tb09.142.0320 56 Tb05.5K5.140 63
      Tb09.142.0350 201 Tb05.5K5.150 12

      There's thousands of these lines, and basically I want a script that would look at Gene_Horn and Gene_DEGSeq and only find those genes that are found in both columns and to put that as the first column in the output file along with the corresponding 2 columns of reads (Unindiced_Horn and Uninduced_DEGSeq).

      Comment


      • #4
        You mean where the column1(Gene_Horn) and column3(Gene_DEGSeq) are same, print the column1, column2 and column3?
        Rahul Sharma,
        Ph.D
        Frankfurt am Main, Germany

        Comment


        • #5
          I mean where column1 (Gene_Horn) and column3 (Gene_DEGSeq) are the same, print a column containing the genes that overlap (called column1), along with column2(Uninduced_Horn) which is the reads of that gene from Horn, and column3 (Uninduced_DEGSeq) which is the reads of that gene from DEGSeq. This way, I can plot both sets of reads for each gene on a scatter plot to see how much variance there is between both data sets.

          Comment


          • #6
            Ok thanks, Please try the following unix one liner:
            awk '$1 ~ $3{print$1"\t"$2"\t"$4}' input.txt > output.txt
            Best,
            Rahul
            Rahul Sharma,
            Ph.D
            Frankfurt am Main, Germany

            Comment


            • #7
              Oops sorry, Please use the following command, It would consider the word boundaries and will generate accurate results:
              awk '"\b"$1"\b" ~ "\b"$3"\b"{print$1"\t"$2"\t"$4}' demo.txt > out.txt
              Thnx
              Rahul Sharma,
              Ph.D
              Frankfurt am Main, Germany

              Comment


              • #8
                Rahul, thanks so much but it only gave me 3 that lined up. I checked and the problem is that the one liner you gave me only looks for those lines that exactly match up and gives me those results, but column1 and column3 dont line up because there are genes that are in one and not in the other. So i need a script that will look at all of column 1 and all of column 3 and give me all those genes that are found in both, not just the ones that are on the same parallel line.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 08:48 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-01-2024, 06:45 AM
                0 responses
                45 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Working...
                X