Seqanswers Leaderboard Ad

**idonaldson** · 11-13-2009, 01:31 AM

nudge

nudge for second viewing, thanks!

**simonandrews** · 11-16-2009, 12:48 AM

There are several ways to do this. A quick way to randomise a data structure in an array is to shuffle it and the read off the first 10 million entries:

perldoc -q shuffle

Since you probably don't want to read 40 million sequences into memory then you could just shuffle an array of integers and then read through the list of sequences printing out only those whose indices were selected.

Another way to go would be to use a random function to decide whether to print out each individual sequence. If you make your function have a probability of 0.25 then you'll get approximately 1/4 of your data printed, but his won't be exactly 1/4 of the data and it will be different each time you run it.

eg:

print $sequence if (rand() < 0.25);

**idonaldson** · 11-17-2009, 03:58 AM

Thanks for your suggestion Simon!

Here is my offering of PERL code for anyone who finds it useful:

#!/usr/bin/perl -w

use List::Util 'shuffle';
use strict;

# Randomise the order of lines in a file
# Ian Donaldson. Nov. 2009

# Usage
unless(@ARGV==2) {
die("$0 | Input file | Output file\n\n");
}

# Open files
open(INPUT, "<$ARGV[0]");
open(OUTPUT, ">$ARGV[1]");

# Put whole file into memory (OK unless very big)
my @list = <INPUT>;

# Shuffle array
my @shuffled = shuffle(@list);

# Print shuffled array to output
print OUTPUT @shuffled;

# Close files
close(INPUT);
close(OUTPUT);

exit;

**idonaldson** · 11-24-2009, 03:25 AM

Better random line extractor script

Here is another version of a script that will extract N random lines from a BED file and prodice a ChIP alignment formatted file. This is all designed for use with GLITR, but could be adapted for other formats. An input file of ~40million tags/lines still required in excess of 2Gb of memory!!!

Attached Files

getRandomTags_index.pl (1.5 KB, 26 views)

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Extracting random tags?

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News