SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Common elements within two set of miRNAs Giorgio C Bioinformatics 1 01-05-2012 09:35 AM
Scripting help to identify adaptors in reads Giorgio C Bioinformatics 7 11-10-2011 06:01 AM
How to extract Common Genes from 2 spreadsheets byou678 Bioinformatics 4 10-04-2011 07:42 PM
Fastq sorting and common part? stoker Bioinformatics 0 07-07-2011 12:53 AM
Help please, regulatory elements johnshembb Bioinformatics 1 04-29-2010 09:57 AM

Reply
 
Thread Tools
Old 07-17-2012, 01:38 AM   #1
Giorgio C
Member
 
Location: ITALY

Join Date: Oct 2010
Posts: 89
Default Scripting Help - Common Elements

Hi all,

I have a question hoping someone can help me.

I have a file with two columns A and B with a list of gene coordinates in each of them:

I'd like to write in another file every element that is present in column A but not in B.
(I mean I'd like to write in output the NON common elements of the two columns).


Do you have any suggestions ( little script in perl or python) not exel cuz the list is huge.


Thanks you in advance,
Giorgio
Giorgio C is offline   Reply With Quote
Old 07-17-2012, 05:15 AM   #2
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Hi Giorgio,

See if this helps:

say you have test.txt like this (tab-separated):
Code:
chr1    chr2
chr2    chr1
chr3    chr4
chr5    chr4
You want elements in column B (2nd) not in A (1st). So the output would be "chr4" outputted twice

This python script should do it. It assumes that all the unique elements in column A can fit in memory:

Code:
python -c "
fin= open('test.txt')

col_a= set()
for line in fin:
    ## Unique elements in column A
    a= line.strip().split()[0]
    col_a.add(a)

fin.seek(0)
for line in fin:
    ## Print out elements in column B not in set A
    b= line.strip().split()[1]
    if b not in col_a:
        print(b)
fin.close()
"
Output goes to stdin so use > to send it to a file

Hope it helps and I haven't made any mistake!

Dario
dariober is offline   Reply With Quote
Old 07-17-2012, 05:31 AM   #3
Giorgio C
Member
 
Location: ITALY

Join Date: Oct 2010
Posts: 89
Default

Thank you so much for your answer I've tried but it give me an error:

Traceback (most recent call last):
File "script.py", line 9, in <module>
a= line.strip().split()[0]
IndexError: list index out of range

Do you know what may be the problem?
Giorgio C is offline   Reply With Quote
Old 07-17-2012, 06:00 AM   #4
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by Giorgio C View Post
Thank you so much for your answer I've tried but it give me an error:

Traceback (most recent call last):
File "script.py", line 9, in <module>
a= line.strip().split()[0]
IndexError: list index out of range

Do you know what may be the problem?
Can you post a sample of the first few lines from your input file? It could be that the first line(s) are emtpy hence the error above.

In fact, this version will skip blank lines:

Code:
python -c "
fin= open('test.txt')

col_a= set()
for line in fin:
    if line.strip() == '':
        continue
    ## Unique elements in column A
    a= line.strip().split()[0]
    col_a.add(a)

fin.seek(0)
for line in fin:
    if line.strip() == '':
        continue
    ## Print out elements in column B not in set A
    b= line.strip().split()[1]
    if b not in col_a:
        print(b)
fin.close()
"
By the way, this script pulls out elements in B not in A, but it doesn't pull out elements in A not in B. Is this ok?
dariober is offline   Reply With Quote
Old 07-17-2012, 06:53 AM   #5
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

"comm" command is useful for this stuff ..

Code:
-bash-3.00$ cat junk
chr1    chr2
chr2    chr1
chr3    chr4
chr5    chr4
-bash-3.00$ awk '{print $1}' < junk | sort | uniq > junkcol1
-bash-3.00$ awk '{print $2}' < junk | sort | uniq > junkcol2
-bash-3.00$ comm -3 junkcol1 junkcol2
chr3
        chr4
chr5
----
check out the "comm" command with "man comm" or via search engine.
Richard Finney is offline   Reply With Quote
Old 07-17-2012, 08:19 AM   #6
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Rather than write your own tool, BEDTools intersectBed can do exactly this, if you convert your lists to .bed format
swbarnes2 is offline   Reply With Quote
Old 07-18-2012, 01:04 AM   #7
Giorgio C
Member
 
Location: ITALY

Join Date: Oct 2010
Posts: 89
Default

Thank you Dariober,

as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

Richard Finney thank you for your other suggestion, it works good too.
Giorgio C is offline   Reply With Quote
Old 07-18-2012, 01:52 AM   #8
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by Giorgio C View Post
Thank you Dariober,

as you said it was a problem of some empty lines in the middle of the list that I did not see at first. Now it works good and however the 'skip blank script' version works greatly.

Richard Finney thank you for your other suggestion, it works good too.
Glad to hear it worked. By the way, I realized that if your input file is tab-separated you should replace in the script "split()" with "split('\t')". The script I posted splits lines into columns at every occurance of a blank space, so including, but not restricing to, tab characters.

Good luck!
Dario
dariober is offline   Reply With Quote
Old 07-18-2012, 03:35 AM   #9
Giorgio C
Member
 
Location: ITALY

Join Date: Oct 2010
Posts: 89
Default

Yes infact, that was one of the thing I've noticed, in the tab separeted file needs to add (\t) to the script.

By the way it works good !

Thank you again for your precious help.

Cheers,
Giorgio
Giorgio C is offline   Reply With Quote
Old 10-05-2013, 08:58 AM   #10
LudesMeyers
Junior Member
 
Location: Austin, tx

Join Date: Aug 2013
Posts: 2
Default

This is a great discussion for a novice scriptor of Python and bash/C shells. Thank you Giorgio C for all of your questions. They provide food for my thoughts. And thanks to Richard Finney for pointing out the use of the "comm" and "awk" commands.

This code is a combo of my first thoughts on how to approach the problem with subsequent introduction of the "comm" command.

Code:
bash-3.2$ cat junk
chr1	chr2
chr2	chr1
chr3	chr4
chr5	chr4
bash-3.2$ cut -f1 junk | sort | uniq > junkcol1
bash-3.2$ cut -f2 junk | sort | uniq > junkcol2
bash-3.2$ comm -3 junkcol1 junkcol2 > commonjunk
bash-3.2$ cat commonjunk
chr3
	chr4
chr5
Thanks everyone,
John
LudesMeyers is offline   Reply With Quote
Old 10-05-2013, 10:53 AM   #11
Giorgio C
Member
 
Location: ITALY

Join Date: Oct 2010
Posts: 89
Smile

Sure John,

this is a very useful forum
Giorgio C is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO