SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to convert .txt file to .bed or .gff, How can we use chip seq data in R software forevermark4 Bioinformatics 57 06-30-2014 06:01 AM
CNV Simulator -- Random Number Generation gprakhar Bioinformatics 5 05-11-2012 10:34 PM
How to replace select reads in a bam file? Heisman Bioinformatics 8 01-02-2012 03:49 PM
ChIP- seq analysis from bed file mathew Bioinformatics 0 09-29-2011 08:57 PM
Simulating random ChIP-seq peaks nbr Bioinformatics 0 11-03-2009 02:49 PM

Reply
 
Thread Tools
Old 09-14-2012, 03:54 PM   #1
yaolij@gmail.com
Junior Member
 
Location: Los angeles

Join Date: Sep 2012
Posts: 1
Question question about random select certain number of reads from ChIP-seq bed file

Hi everyone:
Now I want to normalize ChIP-seq bed file reads to certain number by randomly choose certain number of reads from the original bed file.
Dose anyone have such kind of scripts to do this? I tried to do the perl script by ran() operator. It will work for small bed file. But it will calculate forever if the reads if around 38 million. So is there any way I can unsort the @array element? So that I can use head to form a certain number of reads by randomly selection. Or Does anyone have better idea about that?

Thank you

Lijing
yaolij@gmail.com is offline   Reply With Quote
Old 09-15-2012, 03:13 PM   #2
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Hi Lijing,

The (python) script below will downsample a file by writing out each line with probability p. E.g. if you want to sample a bed file from 38 million reads to 1 million p= 1/38 (~0.026):

Code:
linesampler.py full.bed sampled.bed 0.026
You can optionally pass a 4th argument as seed to the random number generator to make the sampling repeatable.

It's not super-fast being Python but it shouldn't take more than few minutes for sampling a bed file of tens of millions of rows.

The code for linesampler.py:

Code:
#!/usr/local/bin/python

import sys
import random

if len(sys.argv) < 3 or len(sys.argv) > 5:
    sys.exit("""
Sample lines from file.

USAGE:
    linesampler.py <file in> <file out> <p> <seed>
    
file in:  File to sample
file out: Output file
p:        Probability of a line to be sampled (sent to output)
seed :    (Optional) Seed to start the sequence of random numbers
            """)
    
p= float(sys.argv[3])
if p < 0 or p > 1:
    sys.exit('Invalid p (%s): p must be between 0 and 1' %(sys.argv[3]))

if len(sys.argv) == 5:
    rseed= sys.argv[4]
else:
    rseed= None
random.seed(rseed)

fin= open(sys.argv[1])
fout= open(sys.argv[2], 'w')

for line in fin:
    prand= random.uniform(0,1)
    if prand <= p:
        fout.write(line)

fin.close()
fout.close()
sys.exit()
Hope it helps
Dario
dariober is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:06 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO