View Single Post
Old 02-09-2017, 12:16 PM   #6
tomc
Member
 
Location: Oregon

Join Date: Feb 2011
Posts: 29
Default

Q. Should I use unix commands or python scripts?

A. Yes.


A less misunderstandable answer is; it all depends on your current situation.

A ready made script in any language with extra bells and whistles is nice if it does what you want and runs in your environment; but they often do not age gracefully.

Writing your own script in any language is a great skill to have and may well be critical at some point so practice is always good. But learning takes time, mistakes will be made and an entire universe of details not directly related to the biological problem at hand enter consideration; time / space trade off algorithmic complexity yada ya.

In my opinion, by default, the shell is the option to beat. In other words it pays to demonstrate that assembling a shell script is insufficient before moving on to writing your own solution in a general purpose language.

In your case you say you have blast results from A-B and B-A.
I have no idea what that means.
Blast can be output in many different formats and any sort of filtering.

But lets say you do coerce each of the two datasets you have into
a list of pairs of sequence identifiers where the second in each pair
is the best hit of the first. i.e.


<File_A_B>
seqida1 seqidbx
seqida2 seqidby
seqida3 seqidbz
...

<File_B_A>
seqidbx seqida3
seqidby seqida2
seqidbz seqida1
...


the task of finding the reciprocal best hits boils down
to finding a pair of identifiers in the first file (a row)
exactly matching a reversed pair of identifiers in the second file.

e.g.
seqida2 seqidby == reverse(seqidby seqida2)


Again I have no idea what size your lists are but assuming they fit comfortably in your machine's memory. one approach may be to:

1) sort the first file because when a list is sorted you can be certain a match could not be behind where you are in the list

sort -u File_B_A > File_A_B_sorted

2) reverse the second list

awk '{print $2,$1}' File_B_A > File_B_A_rev

3) sort the second file

sort -u File_B_A_rev > File_B_A_rev_sorted

4) see what the files have in common

comm -12 File_A_B_sorted File_B_A_rev_sorted > File_AB_RBH

The list produces would be the set of RBHs according to your blast.

Hopefully that gives you a rough outline

edit: add in the `-u` (unique) on sort to remove duplicate from each list

Last edited by tomc; 02-09-2017 at 12:56 PM. Reason: splain stuff
tomc is offline   Reply With Quote