Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reciprocal blast help

    I have 2 datasets (Dataset A and dataset B) in which I have reciprocally allocated as BLASTx hits against one another. However I am having difficulties in identifying those contigs that are hits to one another from both datasets. I have been able to export the BLAST results into sequence tables but I'm not sure how I can identify the reciprocal top hits of one another for a large number of contigs (160,000). Here is an example of how it looks in an Excel spreadsheet:

    aaaaaDataset Aaaaaaaaaaaaaaaaa Dataset B
    Contig123=Contig789_1 Contig789=111Contig123_1
    Contig456=72Contig221 Contig221=Contig456_3
    Contig777=43Contig954 Contig954=3Contig1561_1


    In the example above you can see that the results or hit from each file have characters on the beginning and sometimes on the end of each corresponding hit making it hard to compare using excel formulas. In the example, the first two rows are the ones I'm interested in extracting as they have hit the same contig in both datasets, unlike row 3 which do not match.

    Any help would be greatly appreciated!
    Last edited by Shorash; 07-15-2014, 08:40 PM.

  • #2
    What have you come up with so far? Do you have any scripts to use or code that you have tried to create?

    Some resources


    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    Comment


    • #3
      Not really a direct answer to your question, but a tip for a tool that probably already does what you want:
      I'm mostly using the tool proteinortho (curren version proteinortho5) for reciprocal blast analyses.
      A strong advantage of this tool is, that it does not only direct orthologs via direct reziprocal blast, but can also list the respective paralogs and group them into "orthologeous groups".
      links:

      Background Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases. Results The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes. Conclusions Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.

      Comment


      • #4
        Originally posted by bio_boris View Post
        What have you come up with so far? Do you have any scripts to use or code that you have tried to create?

        Some resources


        http://seqanswers.com/forums/showthread.php?t=20652
        I haven't managed to create any scripts or codes. I've been manually looking at specific genes of interest but it would be great to be able to do all of them at once.

        Comment


        • #5
          Originally posted by someperson View Post
          Not really a direct answer to your question, but a tip for a tool that probably already does what you want:
          I'm mostly using the tool proteinortho (curren version proteinortho5) for reciprocal blast analyses.
          A strong advantage of this tool is, that it does not only direct orthologs via direct reziprocal blast, but can also list the respective paralogs and group them into "orthologeous groups".
          links:

          http://www.biomedcentral.com/1471-2105/12/124
          Great thanks for that, I'll give this a try.

          Comment


          • #6
            Originally posted by Shorash View Post
            how I can identify the reciprocal top hits of one another for a large number of contigs (160,000).
            I would go for tabular blast output and first sort for best hits. So then, depending how you did your blasts, you can have e.g. two best-hit sorted output files with query in the first column and subject in the second. One option would be to cut columns 1-2 and switch the the order in one file and then cat it with the other file. Then you'd sort based on column 1 and only output the lines where uniq -c is 2. I'm sure there's an awk one-liner for this too..
            savetherhino.org

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            43 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            29 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Working...
            X