Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how do i filter rownames based on column value

    I have dataframe with two columns (target_id and fpkm). I want to keep only those in first column that are not duplicated. If they are duplicated then i would like to keep only one based on value in column 2. I have given an example for this kind below.

    target_id fpkm
    comp247393_c0_seq1 3.197885
    comp257058_c0_seq4 1.624577
    comp242590_c0_seq1 1.750319
    comp77911_c0_seq1 1.293059
    comp241426_c0_seq1 1.626589
    comp288413_c0_seq1 14.828853
    comp294436_c0_seq1 11.555596
    comp63603_c0_seq1 1.982386
    comp267138_c0_seq1 8.594494
    comp267138_c0_seq2 11.134958
    comp321623_c0_seq1 6.934149

    In the above dataframe as you can see there are two rownames with the same name (almost) comp267138_c0_seq1 comp267138_c0_seq2 and i want to keep only comp267138_c0_seq2 because it has higher value in column 2. Please help me with this....

  • #2
    Assuming you want to keep the seq number, it could be done with a moderately simple python script:
    Code:
    fh = open('file_name')
    print fh.readline() # Clear the header
    best_lines = {}
    for line in fh:
        id, fpkm = line.strip().split()
        fpkm = float(fpkm)  # Turn into a number
        id_base, id_seqnum = id.rsplit('_', 1) # Assume everything before _seq is the same
    
        if id_base not in best_lines:
            best_lines[id_base] = (fpkm, id_seqnum)
        else:
            if fpkm > best_lines[id_base][0]:
                best_lines[id_base] = (fpkm, id_seqnum)
    
    for id_base in best_lines:
        fpkm, id_seqnum = best_lines[id_base]
        print id_base+"_"+id_seqnum, fpkm

    This won't necessarily retain the original order of the file, but will deal with the possibility that, for instance, comp267138_c0_seq1 and comp267138_c0_seq2 aren't in adjacent lines.

    Comment


    • #3
      Originally posted by rflrob View Post
      Assuming you want to keep the seq number, it could be done with a moderately simple python script:
      Code:
      fh = open('file_name')
      print fh.readline() # Clear the header
      best_lines = {}
      for line in fh:
          id, fpkm = line.strip().split()
          fpkm = float(fpkm)  # Turn into a number
          id_base, id_seqnum = id.rsplit('_', 1) # Assume everything before _seq is the same
      
          if id_base not in best_lines:
              best_lines[id_base] = (fpkm, id_seqnum)
          else:
              if fpkm > best_lines[id_base][0]:
                  best_lines[id_base] = (fpkm, id_seqnum)
      
      for id_base in best_lines:
          fpkm, id_seqnum = best_lines[id_base]
          print id_base+"_"+id_seqnum, fpkm

      This won't necessarily retain the original order of the file, but will deal with the possibility that, for instance, comp267138_c0_seq1 and comp267138_c0_seq2 aren't in adjacent lines.
      Hi rflrob, it worked perfectly. I have been struggling to write something like this in perl for a while but couldn't get it to work and your script worked like a charm. Don't worry about the order of id's as i am not too worried about them as long as i filter the columns. Thanks a lot again man.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 11:49 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-24-2024, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Working...
      X