Seqanswers Leaderboard Ad

**wlangdon** · 01-16-2013, 12:37 AM

You could use gawk:
(FNR==1){file++}
(file==1){id[$1]=1} #assume id is first item on each line
(file==2){if(!($1 in id))print} #if id in 1st file,ignore line in 2nd file,else print line
Bill

http://www.cs.ucl.ac.uk/staff/W.Langdon/

**M.Verma** · 01-16-2013, 12:52 AM

Hi wlangdon,

Thanks for replying back but it isn't working on my file.it's printing the whole number which are present in second file.

thanks

**priesgo** · 01-16-2013, 01:56 AM

Hi there,

You can first join the two files in the id column. Something like:

Code:

join -j your_id_column file1 file2 > file3

And then substract this result from the original with grep. Something like:

Code:

grep -F -x -v -f file3 file1

Hope it helps!

Pablo.

**M.Verma** · 01-16-2013, 03:44 AM

Hi priesgo,

it isn't working. it's giving me only the id's which i wanna exclude and only id's it's giving.
i am pasting small portion of file
file 1:
Pc_TC00002
Pc_TC00004
Pc_TC51641
Pc_TC00009
Pc_TC51668
Pc_TC00045
Pc_TC51688

file 2:
Pc_TC00002 >gi|218187330|gb|EEC69757.1| hypothetical protein OsI_00003 [Oryza sativa Indica Group]^Agi|222617557|gb|EEE53689.1| hypothetical protein OsJ_00002 [Oryza sativa Japonica Group]
Pc_TC00004 >gi|115433956|ref|NP_001041736.1| Os01g0100500 [Oryza sativa Japonica Group]^Agi|15128436|dbj|BAB62620.1| P0402A09.1 [Oryza sativa Japonica Group]^Agi|15408844|dbj|BAB64233.1| unknown protein [Oryza sativa Japonica Group]^Agi|88193759|dbj|BAE79749.1| unknown protein [Oryza sativa Japonica Group]^Agi|113531267|dbj|BAF03650.1| Os01g0100500 [Oryza sativa Japonica Group]^Agi|125524044|gb|EAY72158.1| hypothetical protein OsI_00006 [Oryza sativa Indica Group]^Agi|125568664|gb|EAZ10179.1| hypothetical protein OsJ_00005 [Oryza sativa Japonica Group]

so in file 1 i have PC_TC00002 and in file 2 is also this id is there so i want to exclude that id's which are there in file1, like this i have 6k id's in file1 and 17k id's in file2, and all 6k id's are present in file2.

thank you

Mohit Verma

**priesgo** · 01-16-2013, 03:58 AM

I modified your data a little bit to have some output.
File1:

Code:

Pc_TC00002
Pc_TC00004
Pc_TC51641
Pc_TC00009
Pc_TC51668
Pc_TC00045
Pc_TC51688

File2:

Code:

Pc_TC00002 >gi|218187330|gb|EEC69757.1| hypothetical protein OsI_00003 [Oryza sativa Indica Group]^Agi|222617557|gb|EEE53689.1| hypothetical protein OsJ_00002 [Oryza sativa Japonica Group]
Pc_TC00004 >gi|115433956|ref|NP_001041736.1| Os01g0100500 [Oryza sativa Japonica Group]^Agi|15128436|dbj|BAB62620.1| P0402A09.1 [Oryza sativa Japonica Group]^Agi|15408844|dbj|BAB64233.1| unknown protein [Oryza sativa Japonica Group]^Agi|88193759|dbj|BAE79749.1| unknown protein [Oryza sativa Japonica Group]^Agi|113531267|dbj|BAF03650.1| Os01g0100500 [Oryza sativa Japonica Group]^Agi|125524044|gb|EAY72158.1| hypothetical protein OsI_00006 [Oryza sativa Indica Group]^Agi|125568664|gb|EAZ10179.1| hypothetical protein OsJ_00005 [Oryza sativa Japonica Group]
Pc_TC00005 >Hello world!

Now:

Code:

join -j 1 file1 file2 > file3

And finally:

Code:

grep -F -x -v -f file3 file2

There you go!

**EGrassi** · 01-16-2013, 04:04 AM

I'm wondering why the join step is needed...wouldn't grep -v -f file1 file2 be sufficient (as long as ID of different proteins than the one of a given line should not appear in file2, I think)?

**priesgo** · 01-16-2013, 04:13 AM

You are right, I didn't know it will match incomplete lines like that. Nice!

**M.Verma** · 01-16-2013, 04:15 AM

hi priesgo,

now it's giving me all the id's which is present in file 2 it isn't excluding the 6k id's from file 2.

thank you

**priesgo** · 01-16-2013, 04:21 AM

You have the fishing cane now! Just fish it!

**syfo** · 01-16-2013, 08:30 AM

1. the awk way

Originally posted by wlangdon View Post

(FNR==1){file++}
(file==1){id[$1]=1} #assume id is first item on each line
(file==2){if(!($1 in id))print} #if id in 1st file,ignore line in 2nd file,else print line

This should work:

Code:

awk '(FNR==1){f++}(f==1){id[$1]=1}(f==2)&&!id[$1]' file1 file2

2. the grep way

Originally posted by EGrassi View Post

grep -v -f file1 file2

I would add the -w option in case you have things like "ID1" in file1 but you do not want to remove "ID10" from file2.
You can also speed up the search by using "fgrep" instead of "grep" -assuming these are exact patterns and not regexps.

Code:

fgrep -vwf file1 file2

3. note

The awk command ensures you only compare the first "columns" -it works whether the separator is a space, a tab or even a variable combination of both- so that a line starting with a valid ID in file2 won't be removed if a forbidden ID is present somewhere in the description.

**M.Verma** · 01-16-2013, 07:40 PM

Thanks syfo it works....:-)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to exclude some id's from the file by grep or any other command

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News