Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transcript length/frequncy plot

    I've created a few different assemblies, and I'd like to make a dotplot of the number of transcripts at a certain length. I'm using R or python. Right now I've got to the point where my data looks like this (each individual transcript is listed):

    Line number, Data set, Length
    "42","b",1258
    "43","b",517
    "44","b",529
    "45","b",593
    "46","b",1075
    "47","b",772

    I want to count the number of times a certain length transcript occurs so I can plot length vs. frequency and show which assemblies are generating longer transcripts. I can do this with R once I have the frequencies, but I can't figure out how to do that. Ultimately, I'd like to turn that file into something like:

    Length, Quantity
    1258, 38
    517, 79

    Etc...any recommendations on how to transform this?

  • #2
    Here's some Python code to do it for you, assuming your file is called data.txt:

    Code:
    inputFile = open("data.txt")
    inputFile.next() # skip that first header line
    frequencyDict = {}
    for line in inputFile:
        curLength = int(line.strip().split(',')[2])
        if curLength in frequencyDict:
            frequencyDict[curLength] += 1
        else:
            frequencyDict[curLength] = 1
    input.close()
    
    outputFile = open("data_transformed.txt",'w')
    outputFile.write("Length, Quantity\n")
    minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
    for i in range(minLength,maxLength+1):
        if i in frequencyDict:
            outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
    outputFile.close()
    You could write it more efficiently with try-except statements, but that's about as clear of an implementation there is IMO and it'll be fast anyway.

    Comment


    • #3
      Thanks for the response- I'm getting this error when I run it on my data:

      Traceback (most recent call last):
      File "frequency.py", line 5, in <module>
      curLength = int(line.strip().split(',')[2])

      IndexError: list index out of range

      Any idea what's causing that? I'm not very good at interpereting this kind of code
      Last edited by dacotahm; 12-10-2012, 04:20 PM.

      Comment


      • #4
        It means that it can't find the third element in the list after splitting the line on commas. Do all of your lines have 2 commas or does your data file look different?

        Comment


        • #5
          I saw the seperator was a comma, ...I wasn't sure what it was choking on so I ran it on just this fragment and got the same error:

          5,b,787
          6,b,493
          7,b,1344
          8,b,906
          9,b,960
          10,b,544
          11,b,4109
          12,b,513
          13,b,578
          14,b,687
          15,b,551
          16,b,563
          17,b,1075
          18,b,932
          19,b,543
          20,b,429
          21,b,761
          22,b,834
          23,b,572
          24,b,300
          25,b,4790
          26,b,667
          27,b,656
          28,b,1086
          29,b,544
          30,b,1293

          Comment


          • #6
            I just ran the same code on the same fragment of data and got the following output:

            300, 1
            429, 1
            493, 1
            513, 1
            543, 1
            544, 2
            551, 1
            563, 1
            572, 1
            578, 1
            656, 1
            667, 1
            687, 1
            761, 1
            834, 1
            906, 1
            932, 1
            960, 1
            1075, 1
            1086, 1
            1293, 1
            1344, 1
            4109, 1
            4790, 1

            Note that in the script that should read inputFile.close() and not input.close(). Are you running this on a plain text file? Or are you running this based on something that has been in Microsoft Word or Excel? MS products add strange characters to files, but when I copy your test data to a plain text file and save it as 'data.txt', I get no errors...

            Comment


            • #7
              I should also add that the above output skipped the first line of your file, assuming it had a header line in it-- that's why 787 is missing.

              Comment


              • #8
                I copied it right out of my terminal into Kate and saved it as plain text. I did change the input.close() to inputFile.close(). It still returns the same error.

                Comment


                • #9
                  Do you have any empty lines in the file? Otherwise, I don't know what to tell you-- it totally worked as is for me. Assuming you don't have some underlying problem with your input file, this will catch that error and proceed anyway:

                  Code:
                  inputFile = open("data.txt")
                  inputFile.next() # skip that first header line
                  frequencyDict = {}
                  for line in inputFile:
                      try:
                          curLength = int(line.strip().split(',')[2])
                      except:
                          continue
                      if curLength in frequencyDict:
                          frequencyDict[curLength] += 1
                      else:
                          frequencyDict[curLength] = 1
                  inputFile.close()
                  
                  outputFile = open("data_transformed.txt",'w')
                  outputFile.write("Length, Quantity\n")
                  minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
                  for i in range(minLength,maxLength+1):
                      if i in frequencyDict:
                          outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
                  outputFile.close()

                  Comment


                  • #10
                    Hm....it worked when I copied a fragment into a new text file using nano but not when I saved the same fragment into a plain text with gedit or kate...

                    But I think I may have it- thank you for your help and patience.

                    Comment


                    • #11
                      Ah cool, good luck with the rest of your analysis! I used emacs FWIW.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      25 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      22 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X