Seqanswers Leaderboard Ad

**swbarnes2** · 01-16-2012, 11:07 AM

I think a simpler way to do it...just get all the reads from a particular set of tiles (I don't know how that translates on HiSeq machines) Since the first few and last few tiles are bad, pick some tiles in the middle, by grepping the read names, which have teh cluster cooridnates embedded in them. Or zgrep, if the fastqs are already zipped. That should be random with respect to where the read align to the genome, with better quality than picking the top and bottom tiles with head or foot.

But yes, this won't work on multi-line fastqs.

**jmun7616** · 06-15-2012, 04:40 PM

Hey, just notices this thread. This Python solution may be a bit late but it will work on both multi-line fasta and fastq files. Set up is such that all files in same dir will be subsampled, but this can be changed. Also three subsamplings taken, but you can change this easily enough or just use one of them.

Code:

import random
import glob
inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))

for i in range(len(inFiles)):

        for k in range(3):
                fNames = []
                fSeqs = []
                same = 0
                outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fa', 'wt'))
                for line in open(inFiles[i]):
                        if ((line[0] == '>') or (line[0] == '@')):
                                fNames.append(line)
                                same = 0
                        else:
                                if (same != 1):
                                        fSeqs.append(line)
                                        same = 1
                                else:
                                        fSeqs[(len(fSeqs)-1)] += line
                curr = (len(outFiles)-1)
                for j in range(num):
                        a = random.randint(0, (len(fNames)-1))
                        outFiles[curr].write(fNames.pop(a))
                        outFiles[curr].write(fSeqs.pop(a))
raw_input("Done.")

**gringer** · 06-17-2012, 06:30 AM

This Python solution may be a bit late but it will work on both multi-line fasta and fastq files.

A few warnings about this Python code:

ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).

**Richard Finney** · 06-17-2012, 07:46 AM

A simple script extracting random fastq records using reservoir sampling ...

# usage: program inputfastq numsubsamples outputsubampfqfile
# example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

cat $1 | \
awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
tr '\t' '\n' > $3

See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

This program joins the 4 fastq lines together into one line , spearating them with a tab character.
These combined lines are subsampled.
At the end, the "tr" command seperates the selected combined lines back into 4 lines.
This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".

**gringer** · 06-17-2012, 02:59 PM

This program joins the 4 fastq lines together into one line , spearating them with a tab character.

That particular code won't work with multi-line fasta/fastq files (the linked subsampler.py script has the same issue, as well as pre-reading all the lines to determine the total number of lines, and some modulus code that may influence probabilities in a bad way), but the reservoir sampling method used would provide a way to process files from standard input in a single pass, and use multi-line files from standard input. However, you need to be a bit careful with the selection procedure to ensure a uniform chance of selection for every record. After a bit of hunting, I see some reasonable pseudo-code here (linked from a stackoverflow thread):

http://blogs.msdn.com/b/spt/archive/...-sampling.aspx

The modified pseudo-code for a fasta/fastq file using reservoir sampling would be as follows:

Look for magic record start sequences. If found, increment the record counter and clear 'current record', otherwise print out the output array and finish
If a fasta file, read through and store up to the next record start as the 'current record', then go to step 6
If a fastq file, read up to the next '+' to work out sequence length (store in 'current record')
read the next line and store in 'current record'
read as many quality values as were in the sequence, store in 'current record'
if record counter is less than desired number of records, add current record to end of output array, return to step 1
generate a random number between 0 and the total number of records read
if this number is less than the number of desired records, replace the array element indexed by that number by the current record, return to step 1

The selection procedure here is (I think) the same as in the awk script that Richard Finney wrote / copied, which is (I think) the same as in the msdn pseudo-code.

**golharam** · 06-19-2012, 12:45 PM

I had this problem too. I actually built a sub sampler that works on fragment and paired-end fastq (gzip'ed or not) data. Its written in Java and based on the Picard framework. https://github.com/golharam/FastqSubsample.

Benefits of using this:
1) No external Python packages to install. All you need is a JRE and the JAR file.
2) Doesn't load everything into memory.

**jmun7616** · 06-20-2012, 03:24 AM

Originally posted by gringer View Post

A few warnings about this Python code:

ASCII quality scores in fastq files can contain arbitrary characters like '>' and '@'. The only way to protect against this for multi-line fastq files is to keep a record of the sequence length.
This code reads the entire set of reads into memory, then outputs random sequences -- this will not work on fasta/fastq files that are larger than the memory limit. While this is the only way to do it for streamed input, the code here uses file names, so can do a two-pass operation and reopen a file (see my previous post).

This revised version addresses both these issues:

Code:

import random
import glob
inFiles = glob.glob('*.fasta') + glob.glob('*.fastq')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))


for i in range(len(inFiles)):
        bookMarks = []
        fNames = []
        fSeqs = []
        start = 0
        BkMk = 0                                             
        outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'.fa', 'wt'))

        for line in open(inFiles[i]):
                if ((line[0] == '>') or (line[0] == '@')):
                        start = 1        
                        bookMarks.append(BkMk)
                else:
                        if (start == 1):
                                if (((line.count('A') + line.count('a')+ line.count('C') + line.count('c') +
                                line.count('G') + line.count('g') + line.count('T') + line.count('t')+
                                line.count('N')+ line.count('n')+ line.count('\n')+ line.count('\r')+
                                line.count('\t')) == len(line)) and ((len(line))>1)):
                                        start = 0
                                else:
                                        bookMarks.pop()
                                        start = 0         
                BkMk+=1
        size = (len(bookMarks))
        print str(size) +" sequences identified in " + inFiles[i] + "\n"

        randomMarks = range(size)
        for j in range((len(bookMarks)-num-1)):
                a = random.randint(0, (len(randomMarks)-1))
                randomMarks.pop(a)

        j = 0
        k = 0
        write = 0
        for line in open(inFiles[i]):
                if (write == 1):
                        if ((randomMarks[k]+1) >= size):
                                outFiles[i].write(line)
                                break
                        elif(j == (bookMarks[randomMarks[k]+1])):
                               write = 0
                               k+=1
                               if (k >= num):
                                        break
                        else:
                                outFiles[i].write(line)
                if (j == bookMarks[randomMarks[k]]):
                       write = 1
                       outFiles[i].write(line)
                j+=1
raw_input("Done.")

**maubp** · 06-20-2012, 05:08 AM

It would have been less effort to reuse Biopython's FASTQ parser which already deals with the nasty line wrapping cases. e.g.

OBF » Working with FASTQ files in Biopython when speed matters » Working with FASTQ files in Biopython when speed matters

http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Open Bioinformatics Foundation Homepage

**JahnDavik** · 04-18-2013, 12:25 AM

Thanks for posting this code. I really need it and I really hope it works?

**JahnDavik** · 04-18-2013, 12:27 AM

Ad:random sampling in forward and reverse

Thanks for posting this code. I really need it and I really hope it works.

**JahnDavik** · 04-18-2013, 12:30 AM

Thanks. This is what I'd need. Hope it works :-)

**syfo** · 04-18-2013, 10:02 AM

Originally posted by Richard Finney View Post

A simple script extracting random fastq records using reservoir sampling ...

# usage: program inputfastq numsubsamples outputsubampfqfile
# example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq

cat $1 | \
awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \
awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \
tr '\t' '\n' > $3

See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845

Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);}
Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results.

This program joins the 4 fastq lines together into one line , spearating them with a tab character.
These combined lines are subsampled.
At the end, the "tr" command seperates the selected combined lines back into 4 lines.
This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk".

Reservoir sampling! Thank you so much, this is exactly what I was looking for!

**chayan** · 05-04-2013, 05:31 AM

hii all
how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

Regards
Chayan

**maubp** · 05-04-2013, 06:46 AM

Originally posted by chayan View Post

hii all
how can i use the above mentioned scripts to randomly extract a fixed % of output from .sff file, cause am working of a iontorrent denovo assembly with MIRA and i need to extract both the fastq as well as traceinfo.xml... (as mentioned by the sequence provider as well as Bastein).. thanx for any help in advance

Regards
Chayan

SFF is a binary format with a non-trivial header. Unix text tools like grep are not going to be any help. I would suggest using the Roche 'off instrument' applications which include an SFF tool which can select random reads (Linux only, search for how to get Newbler on these forums for more information).

Alternatively, if you program in Python, this would be pretty trivial using the Biopython SFF support. http://biopython.org Likewise BioHaskell has SFF support and command line tools for working with these files. http://biohaskell.org

**chayan** · 05-04-2013, 11:08 AM

Thanx maubp, i am not sure but probably those Roche applications are not free to access..still i ll search those and surely ll give a try to your other suggestions..again thnx for your kind help

Regards

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 52 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News