Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Perl Script

    Hi, I have 2 sets of 2 columns of data. One set has a column of gene names and the other column is the number of reads per gene. The second set of columns includes one column also with gene names and the other also with the number of reads. Each set used a different program for mapping the genes and counting reads so the columns dont match up. I would like to plot both sets of columns to see the correlation but first I need to create one column of genes that overlap from both and then include 2 columns of reads next to it. Does anyone have a Perl script that can do this?

  • #2
    Hi,
    You can use unix 'join' command for this task. Please paste first 5 lines of your input/output files, So that one could write the script. You may refer this link: http://www.albany.edu/~ig4895/join.htm.
    Best wishes,
    Rahul
    Last edited by rahularjun86; 10-15-2012, 02:39 AM.
    Rahul Sharma,
    Ph.D
    Frankfurt am Main, Germany

    Comment


    • #3
      Hi, thank you. Here are the first several lines of the file:

      Gene_Horn Uninduced_Horn Gene_DEGSeq Uninduced_DEGSeq
      Tb04.24M18.150 12 Tb04.24M18.150 172
      Tb04.3I12.100 21 Tb04.3I12.100 11
      Tb05.28F8.200 97 Tb05.5K5.100 52
      Tb05.30F7.410 43 Tb05.5K5.10 19
      Tb06.3A7.270 572 Tb05.5K5.110 5
      Tb06.3A7.960 74 Tb05.5K5.120 9
      Tb07.26A24.210 100 Tb05.5K5.130 24
      Tb09.142.0320 56 Tb05.5K5.140 63
      Tb09.142.0350 201 Tb05.5K5.150 12

      There's thousands of these lines, and basically I want a script that would look at Gene_Horn and Gene_DEGSeq and only find those genes that are found in both columns and to put that as the first column in the output file along with the corresponding 2 columns of reads (Unindiced_Horn and Uninduced_DEGSeq).

      Comment


      • #4
        You mean where the column1(Gene_Horn) and column3(Gene_DEGSeq) are same, print the column1, column2 and column3?
        Rahul Sharma,
        Ph.D
        Frankfurt am Main, Germany

        Comment


        • #5
          I mean where column1 (Gene_Horn) and column3 (Gene_DEGSeq) are the same, print a column containing the genes that overlap (called column1), along with column2(Uninduced_Horn) which is the reads of that gene from Horn, and column3 (Uninduced_DEGSeq) which is the reads of that gene from DEGSeq. This way, I can plot both sets of reads for each gene on a scatter plot to see how much variance there is between both data sets.

          Comment


          • #6
            Ok thanks, Please try the following unix one liner:
            awk '$1 ~ $3{print$1"\t"$2"\t"$4}' input.txt > output.txt
            Best,
            Rahul
            Rahul Sharma,
            Ph.D
            Frankfurt am Main, Germany

            Comment


            • #7
              Oops sorry, Please use the following command, It would consider the word boundaries and will generate accurate results:
              awk '"\b"$1"\b" ~ "\b"$3"\b"{print$1"\t"$2"\t"$4}' demo.txt > out.txt
              Thnx
              Rahul Sharma,
              Ph.D
              Frankfurt am Main, Germany

              Comment


              • #8
                Rahul, thanks so much but it only gave me 3 that lined up. I checked and the problem is that the one liner you gave me only looks for those lines that exactly match up and gives me those results, but column1 and column3 dont line up because there are genes that are in one and not in the other. So i need a script that will look at all of column 1 and all of column 3 and give me all those genes that are found in both, not just the ones that are on the same parallel line.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X