Seqanswers Leaderboard Ad

**dariober** · 09-15-2012, 02:13 PM

Hi Lijing,

The (python) script below will downsample a file by writing out each line with probability p. E.g. if you want to sample a bed file from 38 million reads to 1 million p= 1/38 (~0.026):

Code:

linesampler.py full.bed sampled.bed 0.026

You can optionally pass a 4th argument as seed to the random number generator to make the sampling repeatable.

It's not super-fast being Python but it shouldn't take more than few minutes for sampling a bed file of tens of millions of rows.

The code for linesampler.py:

Code:

#!/usr/local/bin/python

import sys
import random

if len(sys.argv) < 3 or len(sys.argv) > 5:
    sys.exit("""
Sample lines from file.

USAGE:
    linesampler.py <file in> <file out> <p> <seed>
    
file in:  File to sample
file out: Output file
p:        Probability of a line to be sampled (sent to output)
seed :    (Optional) Seed to start the sequence of random numbers
            """)
    
p= float(sys.argv[3])
if p < 0 or p > 1:
    sys.exit('Invalid p (%s): p must be between 0 and 1' %(sys.argv[3]))

if len(sys.argv) == 5:
    rseed= sys.argv[4]
else:
    rseed= None
random.seed(rseed)

fin= open(sys.argv[1])
fout= open(sys.argv[2], 'w')

for line in fin:
    prand= random.uniform(0,1)
    if prand <= p:
        fout.write(line)

fin.close()
fout.close()
sys.exit()

Hope it helps
Dario

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 52 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

question about random select certain number of reads from ChIP-seq bed file

Comment

Latest Articles

ad_right_rmr

News