dacotahm 12-10-2012 12:49 PM

Transcript length/frequncy plot

I've created a few different assemblies, and I'd like to make a dotplot of the number of transcripts at a certain length. I'm using R or python. Right now I've got to the point where my data looks like this (each individual transcript is listed):

Line number, Data set, Length
"42","b",1258
"43","b",517
"44","b",529
"45","b",593
"46","b",1075
"47","b",772

I want to count the number of times a certain length transcript occurs so I can plot length vs. frequency and show which assemblies are generating longer transcripts. I can do this with R once I have the frequencies, but I can't figure out how to do that. Ultimately, I'd like to turn that file into something like:

Length, Quantity
1258, 38
517, 79

Etc...any recommendations on how to transform this?

 ucpete 12-10-2012 01:37 PM

Here's some Python code to do it for you, assuming your file is called data.txt:

Code:

```inputFile = open("data.txt") inputFile.next() # skip that first header line frequencyDict = {} for line in inputFile:     curLength = int(line.strip().split(',')[2])     if curLength in frequencyDict:         frequencyDict[curLength] += 1     else:         frequencyDict[curLength] = 1 input.close() outputFile = open("data_transformed.txt",'w') outputFile.write("Length, Quantity\n") minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys()) for i in range(minLength,maxLength+1):     if i in frequencyDict:         outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i]))) outputFile.close()```
You could write it more efficiently with try-except statements, but that's about as clear of an implementation there is IMO and it'll be fast anyway.

 dacotahm 12-10-2012 04:18 PM

Thanks for the response- I'm getting this error when I run it on my data:

Traceback (most recent call last):
File "frequency.py", line 5, in <module>
curLength = int(line.strip().split(',')[2])

IndexError: list index out of range

Any idea what's causing that? I'm not very good at interpereting this kind of code

 ucpete 12-10-2012 04:22 PM

It means that it can't find the third element in the list after splitting the line on commas. Do all of your lines have 2 commas or does your data file look different?

 dacotahm 12-10-2012 04:25 PM

I saw the seperator was a comma, ...I wasn't sure what it was choking on so I ran it on just this fragment and got the same error:

5,b,787
6,b,493
7,b,1344
8,b,906
9,b,960
10,b,544
11,b,4109
12,b,513
13,b,578
14,b,687
15,b,551
16,b,563
17,b,1075
18,b,932
19,b,543
20,b,429
21,b,761
22,b,834
23,b,572
24,b,300
25,b,4790
26,b,667
27,b,656
28,b,1086
29,b,544
30,b,1293

 ucpete 12-10-2012 04:31 PM

I just ran the same code on the same fragment of data and got the following output:

300, 1
429, 1
493, 1
513, 1
543, 1
544, 2
551, 1
563, 1
572, 1
578, 1
656, 1
667, 1
687, 1
761, 1
834, 1
906, 1
932, 1
960, 1
1075, 1
1086, 1
1293, 1
1344, 1
4109, 1
4790, 1

Note that in the script that should read inputFile.close() and not input.close(). Are you running this on a plain text file? Or are you running this based on something that has been in Microsoft Word or Excel? MS products add strange characters to files, but when I copy your test data to a plain text file and save it as 'data.txt', I get no errors...

 ucpete 12-10-2012 04:33 PM

I should also add that the above output skipped the first line of your file, assuming it had a header line in it-- that's why 787 is missing.

 dacotahm 12-10-2012 04:37 PM

I copied it right out of my terminal into Kate and saved it as plain text. I did change the input.close() to inputFile.close(). It still returns the same error.

 ucpete 12-10-2012 04:43 PM

Do you have any empty lines in the file? Otherwise, I don't know what to tell you-- it totally worked as is for me. Assuming you don't have some underlying problem with your input file, this will catch that error and proceed anyway:

Code:

```inputFile = open("data.txt") inputFile.next() # skip that first header line frequencyDict = {} for line in inputFile:     try:         curLength = int(line.strip().split(',')[2])     except:         continue     if curLength in frequencyDict:         frequencyDict[curLength] += 1     else:         frequencyDict[curLength] = 1 inputFile.close() outputFile = open("data_transformed.txt",'w') outputFile.write("Length, Quantity\n") minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys()) for i in range(minLength,maxLength+1):     if i in frequencyDict:         outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i]))) outputFile.close()```

 dacotahm 12-10-2012 04:43 PM

Hm....it worked when I copied a fragment into a new text file using nano but not when I saved the same fragment into a plain text with gedit or kate...

But I think I may have it- thank you for your help and patience.

 ucpete 12-10-2012 04:44 PM

Ah cool, good luck with the rest of your analysis! I used emacs FWIW.

 All times are GMT -8. The time now is 07:24 PM.