Hey all,
If there is a better forum for this type of question, let me know; this is the only one I frequent currently but I can expand if necessary.
Anyways, the question itself is basic; to illustrate this problem here's a short script that reads in two 3-column bed files (both sorted) that compares if any of the lines are the same and just prints to the screen after it processes 10,000 lines. Every row in the sample file is present in the data file, but there are extra lines in the data file:
The code works, but it works slowly. On my personal laptop (using Windows), it works plenty fast and using files of 800,000+ and 280,000+ lines respectively takes <10 seconds to complete. On my work computer (using Linux), though, it will take >30 minutes. It reads in the files very fast so the problem is after the for loop begins. Importantly, when the data file (sorted_bed.1) is smaller, it works much faster (judged by the temp output), but that is not the case with the sorted_bed.2 file (I haven't check this on my Windows laptop, just the work computer).
Any ideas would be welcome.
If there is a better forum for this type of question, let me know; this is the only one I frequent currently but I can expand if necessary.
Anyways, the question itself is basic; to illustrate this problem here's a short script that reads in two 3-column bed files (both sorted) that compares if any of the lines are the same and just prints to the screen after it processes 10,000 lines. Every row in the sample file is present in the data file, but there are extra lines in the data file:
Code:
#!/usr/bin/env python import csv import sys datafile = open(sys.argv[1], "rb") datareader = csv.reader(datafile) data = [] for row in datareader: data.append(row) samplefile = open(sys.argv[2], "rb") samplereader = csv.reader(samplefile) sample = [] for row in samplereader: sample.append(row) begin = 0 temp = 0 for i in range(0,len(sample)): for j in range(begin,len(data)): temp = temp + 1 if ((temp%100) == 0): print("temp: ", temp) if (sample[i][1] == data[j][1]): begin=(j+1) break
Any ideas would be welcome.
Comment