Seqanswers Leaderboard Ad

**maasha** · 01-13-2012, 12:31 PM

With Biopieces you can do:

Code:

read_fasta -i input.fna | random_records -n 1000 | write_fasta -o random.fna -x

**Richard Finney** · 01-13-2012, 02:11 PM

cat data | shuf | head -NUMBEROFLINES

If the input is fasta, you'll have to join every other line with the next line and undo it. Example:

cat test.fa| awk '{if ((NR%2)==0)print prev"XXXXXX"$0;prev=$0;}' | shuf | head -1000 | sed 's/XXXXXX/\n/'

The shuf may not hang out in your /usr/sbin/ , if not, try

sort -R file.txt | head -NUMBER OF LINES

yep, "sort by random" !!!

**ETHANol** · 01-14-2012, 05:10 AM

Subsampling using 'head -n #"? - SEQanswers

http://seqanswers.com/forums/showthread.php?t=16505

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**dawe** · 01-14-2012, 11:48 AM

Originally posted by Richard Finney View Post

cat data | shuf | head -NUMBEROFLINES

If the input is fasta, you'll have to join every other line with the next line and undo it. Example:

cat test.fa| awk '{if ((NR%2)==0)print prev"XXXXXX"$0;prev=$0;}' | shuf | head -1000 | sed 's/XXXXXX/\n/'

The shuf may not hang out in your /usr/sbin/ , if not, try

sort -R file.txt | head -NUMBER OF LINES

yep, "sort by random" !!!

Beware that shuf and sort -R are GNU. If you have a BSD system (OS X, FreeBSD...) those won't work.

**Richard Finney** · 01-14-2012, 04:37 PM

I think this technique would work on BSD systems (and GNU systems) ...

cat file.txt | awk '{print rand()" "$0}' | sort -n | head -1000 | cut -f2-9999 -d" "

**lh3** · 01-14-2012, 05:59 PM

Suppose we want to sample n elements from a pool of N. The space complexity of the Biopieces solution is O(N) as it loads all sequences into memory. I guess shuf is no better. The optimal algorithm is to use reservoir sampling. The space complexity is O(n) instead of O(N). Of course, if N is not so large, it does not matter.

The following is an awk snippet that randomly samples k=10 lines from a text file. Note that this program maximally keeps k=10 lines in memory.

Code:

cat file.txt|awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

With bioawk, you can process fasta files this way:

Code:

awk -c fastx -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq.fa.gz

**kga1978** · 01-14-2012, 06:41 PM

Re: ETHANol's post: Here's a link to a script we made that will subsample any fastq or fasta file:

http://cl.ly/3Q2Y1Z222M0J220w3c1I

Type subsampler.py -h for instructions - can do SE and PE reads.

**silver_steve** · 01-16-2012, 08:11 AM

Originally posted by maasha View Post

With Biopieces you can do:

Code:

read_fasta -i input.fna | random_records -n 1000 | write_fasta -o random.fna -x

Thanks Maasha. I've had some difficulty installing Biopieces. Particularly, I haven't been able to install the required perl modules. I get errors like this:
ERROR: Can't create '/Library/Perl/Updates/5.12.3/Module'
mkdir /Library/Perl/Updates/5.12.3/Module: Permission denied at /System/Library/Perl/5.12/ExtUtils/Install.pm line 494

Do you know how to get around this?

**maasha** · 01-16-2012, 11:07 PM

You need the correct permissions. Try using "sudo". Also, you should probably not contaminate this thread with Perl support request. Try stack-exchange or the Biopieces google group.

Cheers

Martin

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 49 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Script for extracting random sub-set of sequences

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News