Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Add 'missing' lines of data by using python code

    So I am a beginner when it comes to programming and python and such. But I think I have a very simple question.

    I have large tab-delimited files that for example contain lines like this:

    10000 7
    20000 1
    30000 2
    60000 3

    What I want to have, is a file that also contains the 'missing' lines, such as this:

    10000 7
    20000 1
    30000 2
    40000 0
    50000 0
    60000 3

    The files are rather large as I am working with whole genome sequence data. The first column is basically a position in the genome and the second column is the number of SNPs I find within that 10kb window. However, I don't think this information is even relevant, I just want to write a simple python code that will add these lines to the file by using if else statements.

    So if the position does not match the position of the previous line + 10000, the 'missing line' is written, otherwise the normal occurring line is written.

    I just foresee one problem in this, namely when several lines in a row are missing (as in my example). Does anyone have a smart solution for this simple problem?

    Many thanks!

  • #2
    An easy solution would be to loop over the file and have a variable 'previous':

    !Untested sample code generated by tired coffee deprived me:

    Code:
    previous = 0
    for line in file:
        now = line.split('\t')[0]
        if  now != previous + 10000:
            for n in range(previous + 10000, now, step=10000):
                print(n + "\t0")
        print(line)
        previous = now

    Comment


    • #3
      I will try this soon, definitely!. It always looks so simple in the end but writing it yourself is still a struggle when you've only just started figuring out coding. Thank you so much I might come back to it!

      Comment


      • #4
        If I do this though I get an error that the range function does not take keywords as arguments. Not sure how to solve this yet

        Comment


        • #5
          I won't write out the code since I have to go to a meeting, but you could also take advantage of the power of Pandas data frame objects. If you are new to Python, learn Pandas as soon as possible.

          But you could create a data frame of one column that contain the values:
          10000
          20000
          30000
          ...
          max_value

          Then create a data frame object of your actual values. Then you simply do a "join" on the two tables and it will fill in the missing values by virtue of joining the 2 tables.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Today, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          37 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          39 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          35 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          54 views
          0 likes
          Last Post seqadmin  
          Working...
          X