Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transcript length/frequncy plot

    I've created a few different assemblies, and I'd like to make a dotplot of the number of transcripts at a certain length. I'm using R or python. Right now I've got to the point where my data looks like this (each individual transcript is listed):

    Line number, Data set, Length
    "42","b",1258
    "43","b",517
    "44","b",529
    "45","b",593
    "46","b",1075
    "47","b",772

    I want to count the number of times a certain length transcript occurs so I can plot length vs. frequency and show which assemblies are generating longer transcripts. I can do this with R once I have the frequencies, but I can't figure out how to do that. Ultimately, I'd like to turn that file into something like:

    Length, Quantity
    1258, 38
    517, 79

    Etc...any recommendations on how to transform this?

  • #2
    Here's some Python code to do it for you, assuming your file is called data.txt:

    Code:
    inputFile = open("data.txt")
    inputFile.next() # skip that first header line
    frequencyDict = {}
    for line in inputFile:
        curLength = int(line.strip().split(',')[2])
        if curLength in frequencyDict:
            frequencyDict[curLength] += 1
        else:
            frequencyDict[curLength] = 1
    input.close()
    
    outputFile = open("data_transformed.txt",'w')
    outputFile.write("Length, Quantity\n")
    minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
    for i in range(minLength,maxLength+1):
        if i in frequencyDict:
            outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
    outputFile.close()
    You could write it more efficiently with try-except statements, but that's about as clear of an implementation there is IMO and it'll be fast anyway.

    Comment


    • #3
      Thanks for the response- I'm getting this error when I run it on my data:

      Traceback (most recent call last):
      File "frequency.py", line 5, in <module>
      curLength = int(line.strip().split(',')[2])

      IndexError: list index out of range

      Any idea what's causing that? I'm not very good at interpereting this kind of code
      Last edited by dacotahm; 12-10-2012, 04:20 PM.

      Comment


      • #4
        It means that it can't find the third element in the list after splitting the line on commas. Do all of your lines have 2 commas or does your data file look different?

        Comment


        • #5
          I saw the seperator was a comma, ...I wasn't sure what it was choking on so I ran it on just this fragment and got the same error:

          5,b,787
          6,b,493
          7,b,1344
          8,b,906
          9,b,960
          10,b,544
          11,b,4109
          12,b,513
          13,b,578
          14,b,687
          15,b,551
          16,b,563
          17,b,1075
          18,b,932
          19,b,543
          20,b,429
          21,b,761
          22,b,834
          23,b,572
          24,b,300
          25,b,4790
          26,b,667
          27,b,656
          28,b,1086
          29,b,544
          30,b,1293

          Comment


          • #6
            I just ran the same code on the same fragment of data and got the following output:

            300, 1
            429, 1
            493, 1
            513, 1
            543, 1
            544, 2
            551, 1
            563, 1
            572, 1
            578, 1
            656, 1
            667, 1
            687, 1
            761, 1
            834, 1
            906, 1
            932, 1
            960, 1
            1075, 1
            1086, 1
            1293, 1
            1344, 1
            4109, 1
            4790, 1

            Note that in the script that should read inputFile.close() and not input.close(). Are you running this on a plain text file? Or are you running this based on something that has been in Microsoft Word or Excel? MS products add strange characters to files, but when I copy your test data to a plain text file and save it as 'data.txt', I get no errors...

            Comment


            • #7
              I should also add that the above output skipped the first line of your file, assuming it had a header line in it-- that's why 787 is missing.

              Comment


              • #8
                I copied it right out of my terminal into Kate and saved it as plain text. I did change the input.close() to inputFile.close(). It still returns the same error.

                Comment


                • #9
                  Do you have any empty lines in the file? Otherwise, I don't know what to tell you-- it totally worked as is for me. Assuming you don't have some underlying problem with your input file, this will catch that error and proceed anyway:

                  Code:
                  inputFile = open("data.txt")
                  inputFile.next() # skip that first header line
                  frequencyDict = {}
                  for line in inputFile:
                      try:
                          curLength = int(line.strip().split(',')[2])
                      except:
                          continue
                      if curLength in frequencyDict:
                          frequencyDict[curLength] += 1
                      else:
                          frequencyDict[curLength] = 1
                  inputFile.close()
                  
                  outputFile = open("data_transformed.txt",'w')
                  outputFile.write("Length, Quantity\n")
                  minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
                  for i in range(minLength,maxLength+1):
                      if i in frequencyDict:
                          outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
                  outputFile.close()

                  Comment


                  • #10
                    Hm....it worked when I copied a fragment into a new text file using nano but not when I saved the same fragment into a plain text with gedit or kate...

                    But I think I may have it- thank you for your help and patience.

                    Comment


                    • #11
                      Ah cool, good luck with the rest of your analysis! I used emacs FWIW.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin


                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                        Yesterday, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      45 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X