Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dacotahm
    Member
    • Oct 2011
    • 28

    Transcript length/frequncy plot

    I've created a few different assemblies, and I'd like to make a dotplot of the number of transcripts at a certain length. I'm using R or python. Right now I've got to the point where my data looks like this (each individual transcript is listed):

    Line number, Data set, Length
    "42","b",1258
    "43","b",517
    "44","b",529
    "45","b",593
    "46","b",1075
    "47","b",772

    I want to count the number of times a certain length transcript occurs so I can plot length vs. frequency and show which assemblies are generating longer transcripts. I can do this with R once I have the frequencies, but I can't figure out how to do that. Ultimately, I'd like to turn that file into something like:

    Length, Quantity
    1258, 38
    517, 79

    Etc...any recommendations on how to transform this?
  • ucpete
    Member
    • Dec 2008
    • 35

    #2
    Here's some Python code to do it for you, assuming your file is called data.txt:

    Code:
    inputFile = open("data.txt")
    inputFile.next() # skip that first header line
    frequencyDict = {}
    for line in inputFile:
        curLength = int(line.strip().split(',')[2])
        if curLength in frequencyDict:
            frequencyDict[curLength] += 1
        else:
            frequencyDict[curLength] = 1
    input.close()
    
    outputFile = open("data_transformed.txt",'w')
    outputFile.write("Length, Quantity\n")
    minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
    for i in range(minLength,maxLength+1):
        if i in frequencyDict:
            outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
    outputFile.close()
    You could write it more efficiently with try-except statements, but that's about as clear of an implementation there is IMO and it'll be fast anyway.

    Comment

    • dacotahm
      Member
      • Oct 2011
      • 28

      #3
      Thanks for the response- I'm getting this error when I run it on my data:

      Traceback (most recent call last):
      File "frequency.py", line 5, in <module>
      curLength = int(line.strip().split(',')[2])

      IndexError: list index out of range

      Any idea what's causing that? I'm not very good at interpereting this kind of code
      Last edited by dacotahm; 12-10-2012, 04:20 PM.

      Comment

      • ucpete
        Member
        • Dec 2008
        • 35

        #4
        It means that it can't find the third element in the list after splitting the line on commas. Do all of your lines have 2 commas or does your data file look different?

        Comment

        • dacotahm
          Member
          • Oct 2011
          • 28

          #5
          I saw the seperator was a comma, ...I wasn't sure what it was choking on so I ran it on just this fragment and got the same error:

          5,b,787
          6,b,493
          7,b,1344
          8,b,906
          9,b,960
          10,b,544
          11,b,4109
          12,b,513
          13,b,578
          14,b,687
          15,b,551
          16,b,563
          17,b,1075
          18,b,932
          19,b,543
          20,b,429
          21,b,761
          22,b,834
          23,b,572
          24,b,300
          25,b,4790
          26,b,667
          27,b,656
          28,b,1086
          29,b,544
          30,b,1293

          Comment

          • ucpete
            Member
            • Dec 2008
            • 35

            #6
            I just ran the same code on the same fragment of data and got the following output:

            300, 1
            429, 1
            493, 1
            513, 1
            543, 1
            544, 2
            551, 1
            563, 1
            572, 1
            578, 1
            656, 1
            667, 1
            687, 1
            761, 1
            834, 1
            906, 1
            932, 1
            960, 1
            1075, 1
            1086, 1
            1293, 1
            1344, 1
            4109, 1
            4790, 1

            Note that in the script that should read inputFile.close() and not input.close(). Are you running this on a plain text file? Or are you running this based on something that has been in Microsoft Word or Excel? MS products add strange characters to files, but when I copy your test data to a plain text file and save it as 'data.txt', I get no errors...

            Comment

            • ucpete
              Member
              • Dec 2008
              • 35

              #7
              I should also add that the above output skipped the first line of your file, assuming it had a header line in it-- that's why 787 is missing.

              Comment

              • dacotahm
                Member
                • Oct 2011
                • 28

                #8
                I copied it right out of my terminal into Kate and saved it as plain text. I did change the input.close() to inputFile.close(). It still returns the same error.

                Comment

                • ucpete
                  Member
                  • Dec 2008
                  • 35

                  #9
                  Do you have any empty lines in the file? Otherwise, I don't know what to tell you-- it totally worked as is for me. Assuming you don't have some underlying problem with your input file, this will catch that error and proceed anyway:

                  Code:
                  inputFile = open("data.txt")
                  inputFile.next() # skip that first header line
                  frequencyDict = {}
                  for line in inputFile:
                      try:
                          curLength = int(line.strip().split(',')[2])
                      except:
                          continue
                      if curLength in frequencyDict:
                          frequencyDict[curLength] += 1
                      else:
                          frequencyDict[curLength] = 1
                  inputFile.close()
                  
                  outputFile = open("data_transformed.txt",'w')
                  outputFile.write("Length, Quantity\n")
                  minLength, maxLength = min(frequencyDict.keys()), max(frequencyDict.keys())
                  for i in range(minLength,maxLength+1):
                      if i in frequencyDict:
                          outputFile.write("%s, %s\n" % (str(i),str(frequencyDict[i])))
                  outputFile.close()

                  Comment

                  • dacotahm
                    Member
                    • Oct 2011
                    • 28

                    #10
                    Hm....it worked when I copied a fragment into a new text file using nano but not when I saved the same fragment into a plain text with gedit or kate...

                    But I think I may have it- thank you for your help and patience.

                    Comment

                    • ucpete
                      Member
                      • Dec 2008
                      • 35

                      #11
                      Ah cool, good luck with the rest of your analysis! I used emacs FWIW.

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                        Here are nine questions we think about, in roughly the order they matter, before...
                        06-18-2026, 07:11 AM
                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Today, 05:37 AM
                      0 responses
                      5 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-26-2026, 11:10 AM
                      0 responses
                      16 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      50 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-09-2026, 11:58 AM
                      0 responses
                      109 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...