Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare two files with Awk?

    Hello everyone,

    I have two files of ncRNAs from two different samples. I would like to compare them to each other by creating a single file that contains all the found ncRNAs in a file format such as this:

    Code:
    ncRNA     sample1     sample 2
    The files are currently in the format of:

    Code:
    ncRNA     sample1
    and

    Code:
    ncRNA     sample2
    To make a file similar to this:

    Code:
    miRNA	721Es	    162Es
    ath-miR173	1	-
    ath-miR1886.1	1	-
    ath-miR1886.2	3	-
    ath-miR319a	1	-
    ath-miR390a	59	15
    ath-miR396a	1	1
    ath-miR822	1	2
    ath-miR824	4	5
    ath-miR825	-	1
    ath-miR837-3p	4	-

    Any help on this would be great. A command-line awk script or something similar would be preferred.

    Thanks,
    Brandon

  • #2
    If Perl is acceptable, this should work.......

    #!/usr/bin/perl
    use strict;
    use warnings;

    my $file_one = $ARGV[0] or die $!;

    my $file_two = $ARGV[1] or die $!;

    my $data={};

    read_file_fill_hash($file_one,'first',$data);

    read_file_fill_hash($file_two,'second',$data);

    print_data($data);



    sub print_data{
    my $hash=shift;
    print "miRNA\t721Es\t162Es\n";
    foreach my $mirna (keys %{$data}){
    print "$mirna\t$data->{$mirna}{first}\t$data->{$mirna}{second}\n";
    }
    }

    sub read_file_fill_hash{
    my $file=shift;
    my $which=shift;
    my $reference=shift;
    open(my $han, '<', $file) or die $!;
    while(my $line = <$han>){
    my ($mirna,$result)=split(/\s+/,$line);
    if($which eq 'first'){
    $reference->{$mirna}{first}=$result;
    $reference->{$mirna}{second}='-';
    }else{
    $reference->{$mirna}{first}= '-' if(!exists $reference->{$mirna}{first});
    $reference->{$mirna}{second}=$result;
    }
    }
    close $han;
    }

    Comment


    • #3
      or you can use join
      Code:
      join file1 file2 -a1 -a2 -o 0 -o1.2 -o2.2
      pipe into sed if you really want the dash for the blanks

      Code:
      tr " " "\t" |sed "s/\t\t/\t-\t/" | sed "s/\t$/\t-/"
      Last edited by adamdeluca; 08-02-2010, 12:34 PM.

      Comment


      • #4
        Wow, that is cool

        Comment


        • #5
          Thank you both.

          @adamdeluca:
          Code:
          join 162Es/162Es.dsap.rfam.txt 721Es/721Es.dsap.rfam.txt -a1 -a2 -o 0 -o1.2 -o2.2 > 721Es.172Es.rfam
          join: file 1 is not in sorted order
          join: file 2 is not in sorted order
          Any ideas?

          Comment


          • #6
            Join needs the input files in sorted order

            Code:
            sort -k1 file1 > file1.sorted

            Comment


            • #7
              I guess I forgot to mention that not all ncRNAs are found in both files.

              Some ncRNAs are in one file and not the other. That is what caused the '-' in the combined file. Which was created using DSAP's Comparative miRNAomics (here).

              Any idea how I would process the file due to that problem?

              Comment


              • #8
                That's fine.
                The -a1 option keeps unmatched lines from the first file, and the -a2 keeps the unmatched lines from the second.

                If you want dashes instead of the blank columns, use the sed commands above.

                Comment


                • #9


                  That works. Thank you so much.

                  Comment


                  • #10
                    Adam,

                    I was wondering if you could help me out? I'm trying to do the exact same thing, but with larger files and by matching multiple columns.

                    File 1:
                    Code:
                    Chr5	1522433	1522454	721	1	+	AGGAGAAGGAACAGAATCCAA	.	-1	-1	.	-1	.	.	0
                    Chr2	1526280	1526301	721	1	-	TGCGCCGCCGCTCACCTTCTC	.	-1	-1	.	-1	.	.	0
                    Chr2	1526352	1526373	721	1	+	CGAGAGCTCGAAGACGAGGCA	.	-1	-1	.	-1	.	.	0
                    Chr4	1528147	1528168	721	6	-	AATACTACAATTTCTTCCATA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
                    Chr4	1528149	1528169	721	2	-	TACTACAATTTCTTCCATAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	20
                    Chr4	1528168	1528189	721	5	-	AAGCCCCTTCTTATATCGAGT	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
                    Chr4	1528189	1528210	721	3	-	CAACAAAACATCTCGTCCCCA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	21
                    Chr4	1528189	1528211	721	4	-	CAACAAAACATCTCGTCCCCAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	22
                    Chr4	1528191	1528211	721	2	-	ACAAAACATCTCGTCCCCAA	Chr4	1528134	1528370	.	miRNA	-	ACC="MI0002407";	20

                    File 2:
                    Code:
                    chloroplast	1375	1402	721	1	-	GCTAGTTATCCAGTTACAGAAGCGACC	.	-1	-1	.	-1	.	.	0
                    chloroplast	1376	1394	721	1	-	CTAGTTATCCAGTTACAG	.	-1	-1	.	-1	.	.	0
                    Chr2	1379	1401	721	1	+	CGACCAGGACGATGAATGGGCG	Chr2	1378	1400	ASRP	ncRNA_Carrington	+	Name=ASRP27130;Note=small	21
                    Chr2	1379	1401	721	1	+	CGACCAGGACGATGAATGGGCG	Chr2	1380	1402	ASRP	ncRNA_Carrington	+	Name=ASRP150295;Note=small	21
                    Chr2	1379	1402	721	1	+	CGACCAGGACGATGAATGGGCGA	Chr2	1380	1402	ASRP	ncRNA_Carrington	+	Name=ASRP150295;Note=small	22
                    chloroplast	1379	1404	721	1	-	GTTATCCAGTTACAGAAGCGACCCC	.	-1	-1	.	-1	.	.	0
                    These two files contain data of smRNAs from a sample in the first 7 columns and then the last 7 columns of the file contains annotations from different databases.

                    What I would like to do is match the first seven columns from both files and then have the last seven columns from each file added to the matching sequences.

                    So basically it would be in the format:

                    [sample smRNAs (7 columns)] [database 1 (7 columns)] [database 2 (7 columns)]

                    I've been trying to adapt the previous strategy to this problem, but thus far I've been unsuccessful.

                    Any help would be greatly appreciated. Thanks.

                    Comment


                    • #11
                      Code:
                      awk '{print $1"_"$2"_"$3"_"$4"_"$5"_"$6"_"$7"\t"$0}' file1
                      will concatenate the first 7 columns giving you a field to use for join.

                      Comment


                      • #12
                        Originally posted by adamdeluca View Post
                        or you can use join
                        Code:
                        join file1 file2 -a1 -a2 -o 0 -o1.2 -o2.2
                        pipe into sed if you really want the dash for the blanks

                        Code:
                        tr " " "\t" |sed "s/\t\t/\t-\t/" | sed "s/\t$/\t-/"

                        Wonderful!
                        This is convenient for two files, but what about three or more files?

                        Comment


                        • #13
                          Originally posted by lix View Post
                          Wonderful!
                          This is convenient for two files, but what about three or more files?
                          Just repeat the same process, join the output of the first command to file3 etc.

                          (((f1+f2)+f3)+f4)...

                          Comment


                          • #14
                            Adam,

                            Thanks for that solution.

                            Comment


                            • #15
                              How to find SNP by comparing two fasta file using perl code?

                              Hello,
                              I am just beginner in Perl,
                              I have Two fasta file of different length.
                              I would like to align them to find difference in nucleotide postion.

                              Output should be like this
                              Total length of fasta files
                              First reference file: 1253630 bp
                              Seconf file: 4523366 bp
                              If match 2nd file is same as 1st reference file.
                              if not match out put should like this
                              Mismatch position of basepair
                              A-C 100025
                              C-T 600045

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X