Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question about random select certain number of reads from ChIP-seq bed file

    Hi everyone:
    Now I want to normalize ChIP-seq bed file reads to certain number by randomly choose certain number of reads from the original bed file.
    Dose anyone have such kind of scripts to do this? I tried to do the perl script by ran() operator. It will work for small bed file. But it will calculate forever if the reads if around 38 million. So is there any way I can unsort the @array element? So that I can use head to form a certain number of reads by randomly selection. Or Does anyone have better idea about that?

    Thank you

    Lijing

  • #2
    Hi Lijing,

    The (python) script below will downsample a file by writing out each line with probability p. E.g. if you want to sample a bed file from 38 million reads to 1 million p= 1/38 (~0.026):

    Code:
    linesampler.py full.bed sampled.bed 0.026
    You can optionally pass a 4th argument as seed to the random number generator to make the sampling repeatable.

    It's not super-fast being Python but it shouldn't take more than few minutes for sampling a bed file of tens of millions of rows.

    The code for linesampler.py:

    Code:
    #!/usr/local/bin/python
    
    import sys
    import random
    
    if len(sys.argv) < 3 or len(sys.argv) > 5:
        sys.exit("""
    Sample lines from file.
    
    USAGE:
        linesampler.py <file in> <file out> <p> <seed>
        
    file in:  File to sample
    file out: Output file
    p:        Probability of a line to be sampled (sent to output)
    seed :    (Optional) Seed to start the sequence of random numbers
                """)
        
    p= float(sys.argv[3])
    if p < 0 or p > 1:
        sys.exit('Invalid p (%s): p must be between 0 and 1' %(sys.argv[3]))
    
    if len(sys.argv) == 5:
        rseed= sys.argv[4]
    else:
        rseed= None
    random.seed(rseed)
    
    fin= open(sys.argv[1])
    fout= open(sys.argv[2], 'w')
    
    for line in fin:
        prand= random.uniform(0,1)
        if prand <= p:
            fout.write(line)
    
    fin.close()
    fout.close()
    sys.exit()
    Hope it helps
    Dario

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin


      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
      Yesterday, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    52 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    45 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Working...
    X