![]() |
Transcript length/frequncy plot
I've created a few different assemblies, and I'd like to make a dotplot of the number of transcripts at a certain length. I'm using R or python. Right now I've got to the point where my data looks like this (each individual transcript is listed):
Line number, Data set, Length "42","b",1258 "43","b",517 "44","b",529 "45","b",593 "46","b",1075 "47","b",772 I want to count the number of times a certain length transcript occurs so I can plot length vs. frequency and show which assemblies are generating longer transcripts. I can do this with R once I have the frequencies, but I can't figure out how to do that. Ultimately, I'd like to turn that file into something like: Length, Quantity 1258, 38 517, 79 Etc...any recommendations on how to transform this? |
Here's some Python code to do it for you, assuming your file is called data.txt:
Code:
inputFile = open("data.txt") |
Thanks for the response- I'm getting this error when I run it on my data:
Traceback (most recent call last): File "frequency.py", line 5, in <module> curLength = int(line.strip().split(',')[2]) IndexError: list index out of range Any idea what's causing that? I'm not very good at interpereting this kind of code |
It means that it can't find the third element in the list after splitting the line on commas. Do all of your lines have 2 commas or does your data file look different?
|
I saw the seperator was a comma, ...I wasn't sure what it was choking on so I ran it on just this fragment and got the same error:
5,b,787 6,b,493 7,b,1344 8,b,906 9,b,960 10,b,544 11,b,4109 12,b,513 13,b,578 14,b,687 15,b,551 16,b,563 17,b,1075 18,b,932 19,b,543 20,b,429 21,b,761 22,b,834 23,b,572 24,b,300 25,b,4790 26,b,667 27,b,656 28,b,1086 29,b,544 30,b,1293 |
I just ran the same code on the same fragment of data and got the following output:
300, 1 429, 1 493, 1 513, 1 543, 1 544, 2 551, 1 563, 1 572, 1 578, 1 656, 1 667, 1 687, 1 761, 1 834, 1 906, 1 932, 1 960, 1 1075, 1 1086, 1 1293, 1 1344, 1 4109, 1 4790, 1 Note that in the script that should read inputFile.close() and not input.close(). Are you running this on a plain text file? Or are you running this based on something that has been in Microsoft Word or Excel? MS products add strange characters to files, but when I copy your test data to a plain text file and save it as 'data.txt', I get no errors... |
I should also add that the above output skipped the first line of your file, assuming it had a header line in it-- that's why 787 is missing.
|
I copied it right out of my terminal into Kate and saved it as plain text. I did change the input.close() to inputFile.close(). It still returns the same error.
|
Do you have any empty lines in the file? Otherwise, I don't know what to tell you-- it totally worked as is for me. Assuming you don't have some underlying problem with your input file, this will catch that error and proceed anyway:
Code:
inputFile = open("data.txt") |
Hm....it worked when I copied a fragment into a new text file using nano but not when I saved the same fragment into a plain text with gedit or kate...
But I think I may have it- thank you for your help and patience. |
Ah cool, good luck with the rest of your analysis! I used emacs FWIW.
|
All times are GMT -8. The time now is 07:24 PM. |
Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.