Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reciproca blast using Python

    HI
    I am running local Blast (version 2.2.6) on a mac machine, and I want to find the homologs between two protein database(named A and B) by reciprocal blast methods under python command lines.
    I have succeeded in blast A against B, and B against A, which resulted in two csv files which contained the best hits from each blast. Next I would like to use these two files to do Reciprocal blast, namely to find the homologs between A and B, according to the blast e-value and bit scores.

    The codes for reciprocal blast was downloaded from http://ged.msu.edu/angus/tutorials/r...cal-blast.html
    and it is like this
    ******************************************************************************************************

    #! /usr/bin/env python
    import sys, csv

    # number below which to discard results
    E_CUTOFF = 1e-3

    def load_csv_to_dict(filename):
    """
    Load the CSV file into a dictionary, tying query sequences to subject
    sequences.
    """
    d = {}

    for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
    # fix the names that start with a 'gi|XXXXXXX|'
    query_name = demangle_name(query_name)
    subject_name = demangle_name(subject_name)

    # convert the e-value into a floating point number
    expect = float(expect)
    if query_name not in d and expect < E_CUTOFF:
    # if we don't already have an entry for this, put it in.
    d[query_name] = subject_name

    return d

    def demangle_name(name):
    """
    This functions strips off the 'gi|XXXXX|' name encoding that NCBI uses.

    Note that NCBI does this automatically for subject sequences.
    """
    if name.startswith('gi|'):
    name = name.split('|')
    name = name[2:]
    name = "|".join(name)

    return name

    ###

    # This is the code that's run when you actually type 'find-reciprocal.py'
    # at the command line; the above are function definitions, that define
    # reusable blocks or chunks of code.

    in_file_1 = sys.argv[1]
    in_file_2 = sys.argv[2]

    d1 = load_csv_to_dict(in_file_1)
    d2 = load_csv_to_dict(in_file_2)

    output = csv.writer(sys.stdout)

    for seqname in d1:
    seqmatch1 = d1[seqname]
    seqmatch2 = d2.get(seqmatch1)

    if seqmatch2 == seqname:
    output.writerow([seqname, seqmatch1])

    *******************************************************************************************************
    However, when I used this command, an error came up like

    *******************************************************************************************************

    Traceback (most recent call last):
    File "ngs-course/blast/find-reciprocal.py", line 49, in <module>
    d1 = load_csv_to_dict(in_file_1)
    File "ngs-course/blast/find-reciprocal.py", line 14, in load_csv_to_dict
    for (query_name, subject_name, score,expect) in csv.reader(open(filename)):
    _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

    *******************************************************************************************************

    Would anyone please help me to modify the command lines to analyse the data? It would be a great appreciation for discussion here. Thanks!
    3
    Always
    66.67%
    2
    Sometimes
    33.33%
    1
    Never
    0.00%
    0

    The poll is expired.

    Last edited by Tsuyoshi; 09-07-2012, 03:53 AM.

  • #2
    Hello,

    I don't have a file to test this but looking at the Pyhton documentation for open() (http://docs.python.org/library/functions.html#open) and at this post found by Google (http://selfsolved.com/problems/pytho...l-newline-mode), you might fix your problem by replacing csv.reader(open(filename)) with csv.reader(open(filename, "rU")) in function load_csv_to_dict(filename). That is, the new function would be:

    Code:
    def load_csv_to_dict(filename):
        """
        Load the CSV file into a dictionary, tying query sequences to subject
        sequences.
        """
        d = {}
    
        for (query_name, subject_name, score,expect) in csv.reader(open(filename, "rU")):
            # fix the names that start with a 'gi|XXXXXXX|'
            query_name = demangle_name(query_name)
            subject_name = demangle_name(subject_name)
    
            # convert the e-value into a floating point number
            expect = float(expect)
            if query_name not in d and expect < E_CUTOFF:
                # if we don't already have an entry for this, put it in.
                d[query_name] = subject_name
    
        return d
    Post a sample file to reproduce the error would help if you can.

    Best
    Dario

    Comment


    • #3
      Dear Dario,
      Thank you for your kind answer!
      Now it works well!
      Thanks!

      Comment


      • #4
        Dear Dario,
        Thank you so much!
        It works!

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin


          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        39 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        41 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        35 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X