Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question about random select certain number of reads from ChIP-seq bed file

    Hi everyone:
    Now I want to normalize ChIP-seq bed file reads to certain number by randomly choose certain number of reads from the original bed file.
    Dose anyone have such kind of scripts to do this? I tried to do the perl script by ran() operator. It will work for small bed file. But it will calculate forever if the reads if around 38 million. So is there any way I can unsort the @array element? So that I can use head to form a certain number of reads by randomly selection. Or Does anyone have better idea about that?

    Thank you

    Lijing

  • #2
    Hi Lijing,

    The (python) script below will downsample a file by writing out each line with probability p. E.g. if you want to sample a bed file from 38 million reads to 1 million p= 1/38 (~0.026):

    Code:
    linesampler.py full.bed sampled.bed 0.026
    You can optionally pass a 4th argument as seed to the random number generator to make the sampling repeatable.

    It's not super-fast being Python but it shouldn't take more than few minutes for sampling a bed file of tens of millions of rows.

    The code for linesampler.py:

    Code:
    #!/usr/local/bin/python
    
    import sys
    import random
    
    if len(sys.argv) < 3 or len(sys.argv) > 5:
        sys.exit("""
    Sample lines from file.
    
    USAGE:
        linesampler.py <file in> <file out> <p> <seed>
        
    file in:  File to sample
    file out: Output file
    p:        Probability of a line to be sampled (sent to output)
    seed :    (Optional) Seed to start the sequence of random numbers
                """)
        
    p= float(sys.argv[3])
    if p < 0 or p > 1:
        sys.exit('Invalid p (%s): p must be between 0 and 1' %(sys.argv[3]))
    
    if len(sys.argv) == 5:
        rseed= sys.argv[4]
    else:
        rseed= None
    random.seed(rseed)
    
    fin= open(sys.argv[1])
    fout= open(sys.argv[2], 'w')
    
    for line in fin:
        prand= random.uniform(0,1)
        if prand <= p:
            fout.write(line)
    
    fin.close()
    fout.close()
    sys.exit()
    Hope it helps
    Dario

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    22 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    24 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    20 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    52 views
    0 likes
    Last Post seqadmin  
    Working...
    X