Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to exclude some id's from the file by grep or any other command

    Hi everyone,

    I have a problem please help me out. I have one file with 5k id's and i have another file in which i have id's and some information about that id's.
    I want to exclude that 5k id's from second file in which id's and information is present.
    How to do this?

    thank you

    Mohit Verma

  • #2
    You could use gawk:
    (FNR==1){file++}
    (file==1){id[$1]=1} #assume id is first item on each line
    (file==2){if(!($1 in id))print} #if id in 1st file,ignore line in 2nd file,else print line
    Bill

    Comment


    • #3
      Hi wlangdon,

      Thanks for replying back but it isn't working on my file.it's printing the whole number which are present in second file.

      thanks

      Comment


      • #4
        Hi there,

        You can first join the two files in the id column. Something like:
        Code:
        join -j your_id_column file1 file2 > file3
        And then substract this result from the original with grep. Something like:
        Code:
        grep -F -x -v -f file3 file1
        Hope it helps!

        Pablo.

        Comment


        • #5
          Hi priesgo,

          it isn't working. it's giving me only the id's which i wanna exclude and only id's it's giving.
          i am pasting small portion of file
          file 1:
          Pc_TC00002
          Pc_TC00004
          Pc_TC51641
          Pc_TC00009
          Pc_TC51668
          Pc_TC00045
          Pc_TC51688

          file 2:
          Pc_TC00002 >gi|218187330|gb|EEC69757.1| hypothetical protein OsI_00003 [Oryza sativa Indica Group]^Agi|222617557|gb|EEE53689.1| hypothetical protein OsJ_00002 [Oryza sativa Japonica Group]
          Pc_TC00004 >gi|115433956|ref|NP_001041736.1| Os01g0100500 [Oryza sativa Japonica Group]^Agi|15128436|dbj|BAB62620.1| P0402A09.1 [Oryza sativa Japonica Group]^Agi|15408844|dbj|BAB64233.1| unknown protein [Oryza sativa Japonica Group]^Agi|88193759|dbj|BAE79749.1| unknown protein [Oryza sativa Japonica Group]^Agi|113531267|dbj|BAF03650.1| Os01g0100500 [Oryza sativa Japonica Group]^Agi|125524044|gb|EAY72158.1| hypothetical protein OsI_00006 [Oryza sativa Indica Group]^Agi|125568664|gb|EAZ10179.1| hypothetical protein OsJ_00005 [Oryza sativa Japonica Group]

          so in file 1 i have PC_TC00002 and in file 2 is also this id is there so i want to exclude that id's which are there in file1, like this i have 6k id's in file1 and 17k id's in file2, and all 6k id's are present in file2.

          thank you

          Mohit Verma

          Comment


          • #6
            I modified your data a little bit to have some output.
            File1:
            Code:
            Pc_TC00002
            Pc_TC00004
            Pc_TC51641
            Pc_TC00009
            Pc_TC51668
            Pc_TC00045
            Pc_TC51688
            File2:
            Code:
            Pc_TC00002 >gi|218187330|gb|EEC69757.1| hypothetical protein OsI_00003 [Oryza sativa Indica Group]^Agi|222617557|gb|EEE53689.1| hypothetical protein OsJ_00002 [Oryza sativa Japonica Group]
            Pc_TC00004 >gi|115433956|ref|NP_001041736.1| Os01g0100500 [Oryza sativa Japonica Group]^Agi|15128436|dbj|BAB62620.1| P0402A09.1 [Oryza sativa Japonica Group]^Agi|15408844|dbj|BAB64233.1| unknown protein [Oryza sativa Japonica Group]^Agi|88193759|dbj|BAE79749.1| unknown protein [Oryza sativa Japonica Group]^Agi|113531267|dbj|BAF03650.1| Os01g0100500 [Oryza sativa Japonica Group]^Agi|125524044|gb|EAY72158.1| hypothetical protein OsI_00006 [Oryza sativa Indica Group]^Agi|125568664|gb|EAZ10179.1| hypothetical protein OsJ_00005 [Oryza sativa Japonica Group]
            Pc_TC00005 >Hello world!
            Now:
            Code:
            join -j 1 file1 file2 > file3
            And finally:
            Code:
            grep -F -x -v -f file3 file2
            There you go!

            Comment


            • #7
              I'm wondering why the join step is needed...wouldn't grep -v -f file1 file2 be sufficient (as long as ID of different proteins than the one of a given line should not appear in file2, I think)?
              Last edited by EGrassi; 01-16-2013, 04:06 AM.

              Comment


              • #8
                You are right, I didn't know it will match incomplete lines like that. Nice!

                Comment


                • #9
                  hi priesgo,

                  now it's giving me all the id's which is present in file 2 it isn't excluding the 6k id's from file 2.

                  thank you

                  Comment


                  • #10
                    You have the fishing cane now! Just fish it!

                    Comment


                    • #11
                      1. the awk way

                      Originally posted by wlangdon View Post
                      (FNR==1){file++}
                      (file==1){id[$1]=1} #assume id is first item on each line
                      (file==2){if(!($1 in id))print} #if id in 1st file,ignore line in 2nd file,else print line
                      This should work:
                      Code:
                      awk '(FNR==1){f++}(f==1){id[$1]=1}(f==2)&&!id[$1]' file1 file2
                      2. the grep way

                      Originally posted by EGrassi View Post
                      grep -v -f file1 file2
                      I would add the -w option in case you have things like "ID1" in file1 but you do not want to remove "ID10" from file2.
                      You can also speed up the search by using "fgrep" instead of "grep" -assuming these are exact patterns and not regexps.

                      Code:
                      fgrep -vwf file1 file2
                      3. note

                      The awk command ensures you only compare the first "columns" -it works whether the separator is a space, a tab or even a variable combination of both- so that a line starting with a valid ID in file2 won't be removed if a forbidden ID is present somewhere in the description.

                      Comment


                      • #12
                        Thanks syfo it works....:-)

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        31 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        32 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X