Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • reciprocal blast

    How to do reciprocal blast for two datasets?

  • #2
    for two datasets A & B
    where records a are in dataset A and records b are in dataset B

    Blast query a against target dataset B to obtain best hit b'
    Blast query b' against target dataset A to obtain best hit a'

    if a == a' then b' is the reciprocal best hit of a (rbh)
    otherwise a and b' are not reciprocal best hits

    this is done for all a in A.
    (and possible for any b in B not a rbh of something in A)

    sorry if this too abstract. but the question is not so specific.
    Last edited by tomc; 04-27-2012, 02:23 PM. Reason: typo

    Comment


    • #3
      Hi,
      firstly i'm sorry for reviving the thread but i have a more specific question on the subject.

      Further i will use the notation used by TOMC in previous post.

      So i blast query a against set B and get the best hit b'. But then should i blast with b' or the full sequence of b'? I ask cos i found cases where if i blast with b' then a == a' but if blast with the full seq of b' then a != a' (they don't equal). Logically one should blast with the full seq of b' at least in my opinion but im not sure.

      Any help is appreciated.
      Thank you

      Comment


      • #4
        Hi kenietz

        Since each data set can be one or many sequences of any lenght,
        the question remains less than fully specified,
        but under the assumption that both datasets are composed of many shorter sequences each which represent some logical unit (transcript, illumima read) then yes
        you blast the entire representation of b identified by the hit b' against dataset A.

        Under different assumptions this could make no sense and you may need to decide what logical unit containing b' to blast against A

        Comment


        • #5
          And how can I do this comparison?

          I mean I already have the output results from blast (axb and bxa). So how can I compare them and get the matches between the 2 files?

          Should I use unix commands or python scripts?

          Comment


          • #6
            Q. Should I use unix commands or python scripts?

            A. Yes.


            A less misunderstandable answer is; it all depends on your current situation.

            A ready made script in any language with extra bells and whistles is nice if it does what you want and runs in your environment; but they often do not age gracefully.

            Writing your own script in any language is a great skill to have and may well be critical at some point so practice is always good. But learning takes time, mistakes will be made and an entire universe of details not directly related to the biological problem at hand enter consideration; time / space trade off algorithmic complexity yada ya.

            In my opinion, by default, the shell is the option to beat. In other words it pays to demonstrate that assembling a shell script is insufficient before moving on to writing your own solution in a general purpose language.

            In your case you say you have blast results from A-B and B-A.
            I have no idea what that means.
            Blast can be output in many different formats and any sort of filtering.

            But lets say you do coerce each of the two datasets you have into
            a list of pairs of sequence identifiers where the second in each pair
            is the best hit of the first. i.e.


            <File_A_B>
            seqida1 seqidbx
            seqida2 seqidby
            seqida3 seqidbz
            ...

            <File_B_A>
            seqidbx seqida3
            seqidby seqida2
            seqidbz seqida1
            ...


            the task of finding the reciprocal best hits boils down
            to finding a pair of identifiers in the first file (a row)
            exactly matching a reversed pair of identifiers in the second file.

            e.g.
            seqida2 seqidby == reverse(seqidby seqida2)


            Again I have no idea what size your lists are but assuming they fit comfortably in your machine's memory. one approach may be to:

            1) sort the first file because when a list is sorted you can be certain a match could not be behind where you are in the list

            sort -u File_B_A > File_A_B_sorted

            2) reverse the second list

            awk '{print $2,$1}' File_B_A > File_B_A_rev

            3) sort the second file

            sort -u File_B_A_rev > File_B_A_rev_sorted

            4) see what the files have in common

            comm -12 File_A_B_sorted File_B_A_rev_sorted > File_AB_RBH

            The list produces would be the set of RBHs according to your blast.

            Hopefully that gives you a rough outline

            edit: add in the `-u` (unique) on sort to remove duplicate from each list
            Last edited by tomc; 02-09-2017, 01:56 PM. Reason: splain stuff

            Comment


            • #7
              Thanks tomc for your detailed explanation and time to answer it.

              This was exactly what I did. I wrote a shell script in different steps. I was not able to write a python script because I am learning...
              I found a few python scripts to do it but did not work for me.

              But my shell script is working well. In the last step I got the duplicates of my combined file instead of the common ones.

              Comment


              • #8
                Here's a BLAST RBH tool I wrote earlier in Python, which does consider duplicates and gives warnings about them:


                It has a Galaxy wrapper but you can ignore that, other than perhaps reading the help text included it it - which suggests thinking about setting minimum identity and minimum alignment lengths and reading this paper:

                Punta and Ofran (2008) The Rough Guide to In Silico Function Prediction,
                or How To Use Sequence and Structure Information To Predict Protein
                Function. PLoS Comput Biol 4(10): e1000160.
                Protein structure prediction,Protein structure,Protein structure comparison,Protein structure databases,Sequence motif analysis,Structural genomics,Protein domains,Sequence alignment

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X