Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • python script running slowly, can't figure out why

    Hey all,

    If there is a better forum for this type of question, let me know; this is the only one I frequent currently but I can expand if necessary.

    Anyways, the question itself is basic; to illustrate this problem here's a short script that reads in two 3-column bed files (both sorted) that compares if any of the lines are the same and just prints to the screen after it processes 10,000 lines. Every row in the sample file is present in the data file, but there are extra lines in the data file:

    Code:
    #!/usr/bin/env python
    
    import csv
    import sys
    
    datafile = open(sys.argv[1], "rb")
    datareader = csv.reader(datafile)
    data = []
    
    for row in datareader:
        data.append(row)
    
    samplefile = open(sys.argv[2], "rb")
    samplereader = csv.reader(samplefile)
    sample = []
    
    for row in samplereader:
        sample.append(row)
    
    begin = 0
    temp = 0
    for i in range(0,len(sample)):
            for j in range(begin,len(data)):
                    temp = temp + 1
                    if ((temp%100) == 0):
                            print("temp: ", temp)
                    if (sample[i][1] == data[j][1]):
                            begin=(j+1)
                            break
    The code works, but it works slowly. On my personal laptop (using Windows), it works plenty fast and using files of 800,000+ and 280,000+ lines respectively takes <10 seconds to complete. On my work computer (using Linux), though, it will take >30 minutes. It reads in the files very fast so the problem is after the for loop begins. Importantly, when the data file (sorted_bed.1) is smaller, it works much faster (judged by the temp output), but that is not the case with the sorted_bed.2 file (I haven't check this on my Windows laptop, just the work computer).

    Any ideas would be welcome.

  • #2
    I wrote an equivalent program in C++ and it works in <5 seconds on my work computer, lol. Still have no idea what's going wrong with the python script. Sometimes I hate computers.

    Comment


    • #3
      Hi Hiseman,

      When you are not sure what is causing the slow down, use a profiler....
      Code:
      python -mcProfile <yourScript.py>
      Anyhow I will venture a few guesses here... I don't know what version of python you are using.. but the first thing that stands out is your use of the range command. If you are using python 2.x should probably want to be using xrange, since range will basically be allocating a new list/array at each loop. The modulo operation can be expensive too... Another thing that comes to mind is that you are not using slicing to move around the array. With slicing this is how the top of your loop would like like:

      Code:
      dataIndex = range(len(data))
      for i in range(0,len(sample)):
              for j in dataIndex[begin:]:
                      temp = temp + 1
                      if ((temp%100) == 0):
                              print("temp: ", temp)
                      if (sample[i][1] == data[j][1]):
                              begin=(j+1)
                              break
      There are many things that can be done but I am not 100% sure what you are trying to do. If what you are trying to do is to intersect to bed files I would use a dictionary to store the smaller one (preferably), then compare if a line is present in second file while it is being read. A quick way to implement this:
      Code:
      import csv
      import sys
      from collections import OrderedDict
      
      datafile = open(sys.argv[1], "rb")
      datareader = csv.reader(datafile)
      data = OrderedDict()
      
      for lineNo,row in enumerate(datareader, 1):
          data[row]=lineNo
      
      samplefile = open(sys.argv[2], "rb")
      samplereader = csv.reader(samplefile)
      sample = []
      
      for lineNo,row in enumerate(samplereader, 1):
          if row in data: print('MATCH LINE %s:%d == %s:%d'%(sys.argv[2],lineNo, sys.argv[1],data[row]))
      I have not tested this... but this is general idea.
      Last edited by fpr; 08-04-2013, 06:40 PM. Reason: typo missing code tags

      Comment


      • #4
        Thank you, I'm a new python user. School is starting up for me tomorrow so it'll be a bit before I can look at this in more detail but I will do so within a week. Thank you very much for the feedback.

        Comment


        • #5
          Originally posted by Heisman View Post
          Thank you, I'm a new python user. School is starting up for me tomorrow so it'll be a bit before I can look at this in more detail but I will do so within a week. Thank you very much for the feedback.
          No Problem! BTW My code is just for illustration. Usually I would try to create a key based on the rows to use with the dictionary (instead of the whole row), e.g. chrom+pos. It all depends on the problem.

          Python is cool language, explore dictionaries and sets the are really useful.

          Good luck.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X