Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reciprocal blast help

    I have 2 datasets (Dataset A and dataset B) in which I have reciprocally allocated as BLASTx hits against one another. However I am having difficulties in identifying those contigs that are hits to one another from both datasets. I have been able to export the BLAST results into sequence tables but I'm not sure how I can identify the reciprocal top hits of one another for a large number of contigs (160,000). Here is an example of how it looks in an Excel spreadsheet:

    aaaaaDataset Aaaaaaaaaaaaaaaaa Dataset B
    Contig123=Contig789_1 Contig789=111Contig123_1
    Contig456=72Contig221 Contig221=Contig456_3
    Contig777=43Contig954 Contig954=3Contig1561_1


    In the example above you can see that the results or hit from each file have characters on the beginning and sometimes on the end of each corresponding hit making it hard to compare using excel formulas. In the example, the first two rows are the ones I'm interested in extracting as they have hit the same contig in both datasets, unlike row 3 which do not match.

    Any help would be greatly appreciated!
    Last edited by Shorash; 07-15-2014, 08:40 PM.

  • #2
    What have you come up with so far? Do you have any scripts to use or code that you have tried to create?

    Some resources


    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    Comment


    • #3
      Not really a direct answer to your question, but a tip for a tool that probably already does what you want:
      I'm mostly using the tool proteinortho (curren version proteinortho5) for reciprocal blast analyses.
      A strong advantage of this tool is, that it does not only direct orthologs via direct reziprocal blast, but can also list the respective paralogs and group them into "orthologeous groups".
      links:

      Background Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases. Results The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes. Conclusions Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.

      Comment


      • #4
        Originally posted by bio_boris View Post
        What have you come up with so far? Do you have any scripts to use or code that you have tried to create?

        Some resources


        http://seqanswers.com/forums/showthread.php?t=20652
        I haven't managed to create any scripts or codes. I've been manually looking at specific genes of interest but it would be great to be able to do all of them at once.

        Comment


        • #5
          Originally posted by someperson View Post
          Not really a direct answer to your question, but a tip for a tool that probably already does what you want:
          I'm mostly using the tool proteinortho (curren version proteinortho5) for reciprocal blast analyses.
          A strong advantage of this tool is, that it does not only direct orthologs via direct reziprocal blast, but can also list the respective paralogs and group them into "orthologeous groups".
          links:

          http://www.biomedcentral.com/1471-2105/12/124
          Great thanks for that, I'll give this a try.

          Comment


          • #6
            Originally posted by Shorash View Post
            how I can identify the reciprocal top hits of one another for a large number of contigs (160,000).
            I would go for tabular blast output and first sort for best hits. So then, depending how you did your blasts, you can have e.g. two best-hit sorted output files with query in the first column and subject in the second. One option would be to cut columns 1-2 and switch the the order in one file and then cat it with the other file. Then you'd sort based on column 1 and only output the lines where uniq -c is 2. I'm sure there's an awk one-liner for this too..
            savetherhino.org

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            68 views
            0 likes
            Last Post seqadmin  
            Working...
            X