Unconfigured Ad

**maasha** · 01-13-2012, 12:31 PM

With Biopieces you can do:

Code:

read_fasta -i input.fna | random_records -n 1000 | write_fasta -o random.fna -x

**Richard Finney** · 01-13-2012, 02:11 PM

cat data | shuf | head -NUMBEROFLINES

If the input is fasta, you'll have to join every other line with the next line and undo it. Example:

cat test.fa| awk '{if ((NR%2)==0)print prev"XXXXXX"$0;prev=$0;}' | shuf | head -1000 | sed 's/XXXXXX/\n/'

The shuf may not hang out in your /usr/sbin/ , if not, try

sort -R file.txt | head -NUMBER OF LINES

yep, "sort by random" !!!

**ETHANol** · 01-14-2012, 05:10 AM

Subsampling using 'head -n #"? - SEQanswers

http://seqanswers.com/forums/showthread.php?t=16505

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**dawe** · 01-14-2012, 11:48 AM

Originally posted by Richard Finney View Post

cat data | shuf | head -NUMBEROFLINES

If the input is fasta, you'll have to join every other line with the next line and undo it. Example:

cat test.fa| awk '{if ((NR%2)==0)print prev"XXXXXX"$0;prev=$0;}' | shuf | head -1000 | sed 's/XXXXXX/\n/'

The shuf may not hang out in your /usr/sbin/ , if not, try

sort -R file.txt | head -NUMBER OF LINES

yep, "sort by random" !!!

Beware that shuf and sort -R are GNU. If you have a BSD system (OS X, FreeBSD...) those won't work.

**Richard Finney** · 01-14-2012, 04:37 PM

I think this technique would work on BSD systems (and GNU systems) ...

cat file.txt | awk '{print rand()" "$0}' | sort -n | head -1000 | cut -f2-9999 -d" "

**lh3** · 01-14-2012, 05:59 PM

Suppose we want to sample n elements from a pool of N. The space complexity of the Biopieces solution is O(N) as it loads all sequences into memory. I guess shuf is no better. The optimal algorithm is to use reservoir sampling. The space complexity is O(n) instead of O(N). Of course, if N is not so large, it does not matter.

The following is an awk snippet that randomly samples k=10 lines from a text file. Note that this program maximally keeps k=10 lines in memory.

Code:

cat file.txt|awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

With bioawk, you can process fasta files this way:

Code:

awk -c fastx -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq.fa.gz

**kga1978** · 01-14-2012, 06:41 PM

Re: ETHANol's post: Here's a link to a script we made that will subsample any fastq or fasta file:

Zight — Not Found

http://cl.ly/3Q2Y1Z222M0J220w3c1I

Type subsampler.py -h for instructions - can do SE and PE reads.

**silver_steve** · 01-16-2012, 08:11 AM

Originally posted by maasha View Post

With Biopieces you can do:

Code:

read_fasta -i input.fna | random_records -n 1000 | write_fasta -o random.fna -x

Thanks Maasha. I've had some difficulty installing Biopieces. Particularly, I haven't been able to install the required perl modules. I get errors like this:
ERROR: Can't create '/Library/Perl/Updates/5.12.3/Module'
mkdir /Library/Perl/Updates/5.12.3/Module: Permission denied at /System/Library/Perl/5.12/ExtUtils/Install.pm line 494

Do you know how to get around this?

**maasha** · 01-16-2012, 11:07 PM

You need the correct permissions. Try using "sudo". Also, you should probably not contaminate this thread with Perl support request. Try stack-exchange or the Biopieces google group.

Cheers

Martin

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 52 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Script for extracting random sub-set of sequences

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News