SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to obtain full length RNA transcript sequence Annibal RNA Sequencing 9 03-14-2012 07:51 AM
length of transcript in cufflinks output papori Bioinformatics 0 08-02-2011 06:06 AM
transcript length bias in enrichment analysis and RPKM PFS RNA Sequencing 1 12-12-2010 06:32 PM
Nonuniformity of reads across transcript length Daniel RNA Sequencing 7 10-07-2010 06:06 AM
normalizing RNA-seq data to "unique transcript length" instead of "transcript length" lmc Bioinformatics 2 06-23-2010 11:45 AM

Reply
 
Thread Tools
Old 12-10-2012, 12:49 PM   #1
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 28
Default Transcript length/frequncy plot

I've created a few different assemblies, and I'd like to make a dotplot of the number of transcripts at a certain length. I'm using R or python. Right now I've got to the point where my data looks like this (each individual transcript is listed):

Line number, Data set, Length
"42","b",1258
"43","b",517
"44","b",529
"45","b",593
"46","b",1075
"47","b",772

I want to count the number of times a certain length transcript occurs so I can plot length vs. frequency and show which assemblies are generating longer transcripts. I can do this with R once I have the frequencies, but I can't figure out how to do that. Ultimately, I'd like to turn that file into something like:

Length, Quantity
1258, 38
517, 79

Etc...any recommendations on how to transform this?
dacotahm is offline   Reply With Quote
Old 12-10-2012, 01:37 PM   #2
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

Here's some Python code to do it for you, assuming your file is called data.txt:

Code:
inputFile = open("data.txt")
inputFile.next() # skip that first header line
frequencyDict = {}
for line in inputFile:
    curLength = int(line.strip().split(',')[2])
    if curLength in frequencyDict:
        frequencyDict[curLength] += 1
    else:
        frequencyDict[curLength] = 1
input.close()

outputFile = open("data_transformed.txt",'w')
outputFile.write("Length, Quantity\n")
minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
for i in range(minLength,maxLength+1):
    if i in frequencyDict:
        outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
outputFile.close()
You could write it more efficiently with try-except statements, but that's about as clear of an implementation there is IMO and it'll be fast anyway.
ucpete is offline   Reply With Quote
Old 12-10-2012, 04:18 PM   #3
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 28
Default

Thanks for the response- I'm getting this error when I run it on my data:

Traceback (most recent call last):
File "frequency.py", line 5, in <module>
curLength = int(line.strip().split(',')[2])

IndexError: list index out of range

Any idea what's causing that? I'm not very good at interpereting this kind of code

Last edited by dacotahm; 12-10-2012 at 04:20 PM.
dacotahm is offline   Reply With Quote
Old 12-10-2012, 04:22 PM   #4
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

It means that it can't find the third element in the list after splitting the line on commas. Do all of your lines have 2 commas or does your data file look different?
ucpete is offline   Reply With Quote
Old 12-10-2012, 04:25 PM   #5
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 28
Default

I saw the seperator was a comma, ...I wasn't sure what it was choking on so I ran it on just this fragment and got the same error:

5,b,787
6,b,493
7,b,1344
8,b,906
9,b,960
10,b,544
11,b,4109
12,b,513
13,b,578
14,b,687
15,b,551
16,b,563
17,b,1075
18,b,932
19,b,543
20,b,429
21,b,761
22,b,834
23,b,572
24,b,300
25,b,4790
26,b,667
27,b,656
28,b,1086
29,b,544
30,b,1293
dacotahm is offline   Reply With Quote
Old 12-10-2012, 04:31 PM   #6
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

I just ran the same code on the same fragment of data and got the following output:

300, 1
429, 1
493, 1
513, 1
543, 1
544, 2
551, 1
563, 1
572, 1
578, 1
656, 1
667, 1
687, 1
761, 1
834, 1
906, 1
932, 1
960, 1
1075, 1
1086, 1
1293, 1
1344, 1
4109, 1
4790, 1

Note that in the script that should read inputFile.close() and not input.close(). Are you running this on a plain text file? Or are you running this based on something that has been in Microsoft Word or Excel? MS products add strange characters to files, but when I copy your test data to a plain text file and save it as 'data.txt', I get no errors...
ucpete is offline   Reply With Quote
Old 12-10-2012, 04:33 PM   #7
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

I should also add that the above output skipped the first line of your file, assuming it had a header line in it-- that's why 787 is missing.
ucpete is offline   Reply With Quote
Old 12-10-2012, 04:37 PM   #8
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 28
Default

I copied it right out of my terminal into Kate and saved it as plain text. I did change the input.close() to inputFile.close(). It still returns the same error.
dacotahm is offline   Reply With Quote
Old 12-10-2012, 04:43 PM   #9
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

Do you have any empty lines in the file? Otherwise, I don't know what to tell you-- it totally worked as is for me. Assuming you don't have some underlying problem with your input file, this will catch that error and proceed anyway:

Code:
inputFile = open("data.txt")
inputFile.next() # skip that first header line
frequencyDict = {}
for line in inputFile:
    try:
        curLength = int(line.strip().split(',')[2])
    except:
        continue
    if curLength in frequencyDict:
        frequencyDict[curLength] += 1
    else:
        frequencyDict[curLength] = 1
inputFile.close()

outputFile = open("data_transformed.txt",'w')
outputFile.write("Length, Quantity\n")
minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
for i in range(minLength,maxLength+1):
    if i in frequencyDict:
        outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
outputFile.close()
ucpete is offline   Reply With Quote
Old 12-10-2012, 04:43 PM   #10
dacotahm
Member
 
Location: ND, USA

Join Date: Oct 2011
Posts: 28
Default

Hm....it worked when I copied a fragment into a new text file using nano but not when I saved the same fragment into a plain text with gedit or kate...

But I think I may have it- thank you for your help and patience.
dacotahm is offline   Reply With Quote
Old 12-10-2012, 04:44 PM   #11
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

Ah cool, good luck with the rest of your analysis! I used emacs FWIW.
ucpete is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO