Seqanswers Leaderboard Ad

**ucpete** · 12-10-2012, 01:37 PM

Here's some Python code to do it for you, assuming your file is called data.txt:

Code:

inputFile = open("data.txt")
inputFile.next() # skip that first header line
frequencyDict = {}
for line in inputFile:
    curLength = int(line.strip().split(',')[2])
    if curLength in frequencyDict:
        frequencyDict[curLength] += 1
    else:
        frequencyDict[curLength] = 1
input.close()

outputFile = open("data_transformed.txt",'w')
outputFile.write("Length, Quantity\n")
minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
for i in range(minLength,maxLength+1):
    if i in frequencyDict:
        outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
outputFile.close()

You could write it more efficiently with try-except statements, but that's about as clear of an implementation there is IMO and it'll be fast anyway.

**dacotahm** · 12-10-2012, 04:18 PM

Thanks for the response- I'm getting this error when I run it on my data:

Traceback (most recent call last):
File "frequency.py", line 5, in <module>
curLength = int(line.strip().split(',')[2])

IndexError: list index out of range

Any idea what's causing that? I'm not very good at interpereting this kind of code

**ucpete** · 12-10-2012, 04:22 PM

It means that it can't find the third element in the list after splitting the line on commas. Do all of your lines have 2 commas or does your data file look different?

**dacotahm** · 12-10-2012, 04:25 PM

I saw the seperator was a comma, ...I wasn't sure what it was choking on so I ran it on just this fragment and got the same error:

5,b,787
6,b,493
7,b,1344
8,b,906
9,b,960
10,b,544
11,b,4109
12,b,513
13,b,578
14,b,687
15,b,551
16,b,563
17,b,1075
18,b,932
19,b,543
20,b,429
21,b,761
22,b,834
23,b,572
24,b,300
25,b,4790
26,b,667
27,b,656
28,b,1086
29,b,544
30,b,1293

**ucpete** · 12-10-2012, 04:31 PM

I just ran the same code on the same fragment of data and got the following output:

300, 1
429, 1
493, 1
513, 1
543, 1
544, 2
551, 1
563, 1
572, 1
578, 1
656, 1
667, 1
687, 1
761, 1
834, 1
906, 1
932, 1
960, 1
1075, 1
1086, 1
1293, 1
1344, 1
4109, 1
4790, 1

Note that in the script that should read inputFile.close() and not input.close(). Are you running this on a plain text file? Or are you running this based on something that has been in Microsoft Word or Excel? MS products add strange characters to files, but when I copy your test data to a plain text file and save it as 'data.txt', I get no errors...

**ucpete** · 12-10-2012, 04:33 PM

I should also add that the above output skipped the first line of your file, assuming it had a header line in it-- that's why 787 is missing.

**dacotahm** · 12-10-2012, 04:37 PM

I copied it right out of my terminal into Kate and saved it as plain text. I did change the input.close() to inputFile.close(). It still returns the same error.

**ucpete** · 12-10-2012, 04:43 PM

Do you have any empty lines in the file? Otherwise, I don't know what to tell you-- it totally worked as is for me. Assuming you don't have some underlying problem with your input file, this will catch that error and proceed anyway:

Code:

inputFile = open("data.txt")
inputFile.next() # skip that first header line
frequencyDict = {}
for line in inputFile:
    try:
        curLength = int(line.strip().split(',')[2])
    except:
        continue
    if curLength in frequencyDict:
        frequencyDict[curLength] += 1
    else:
        frequencyDict[curLength] = 1
inputFile.close()

outputFile = open("data_transformed.txt",'w')
outputFile.write("Length, Quantity\n")
minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
for i in range(minLength,maxLength+1):
    if i in frequencyDict:
        outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
outputFile.close()

**dacotahm** · 12-10-2012, 04:43 PM

Hm....it worked when I copied a fragment into a new text file using nano but not when I saved the same fragment into a plain text with gedit or kate...

But I think I may have it- thank you for your help and patience.

**ucpete** · 12-10-2012, 04:44 PM

Ah cool, good luck with the rest of your analysis! I used emacs FWIW.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Transcript length/frequncy plot

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News