Seqanswers Leaderboard Ad

**Simon Anders** · 06-15-2011, 08:13 AM

This sounds like a job for HTSeq:

Code:

import sys, random
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

while True:
   read1 = next( in1 )
   read2 = next( in2 )
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()

Save this as subsample.py and call it as

Code:

python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>

(where <fraction> is a number between 0 and 1, giving the sampling faction)

Simon

**dnusol** · 06-16-2011, 06:11 AM

Hi Simon, thanks for your help. I run the script and it seemed to work, the new files are generated, but I get this stderr:

Traceback (most recent call last):
File "/home/david/pyscripts/fastqPairedSubsample.py", line 15, in <module>
read1 = next( in1 )
File "/usr/local/lib/python2.6/dist-packages/HTSeq/__init__.py", line 381, in __iter__
id1 = fin.next()
StopIteration

Is that something I should worry about? What does it mean?

Cheers,

Dave

**Simon Anders** · 06-16-2011, 07:04 AM

Hi Dave

should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

Code:

 
import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()

**sheng** · 06-19-2011, 07:54 PM

Single-end version of random sampling?

Hi Simon,

Thanks a lot for presenting this python code for pair-end seq data random sampling. I am new to python and the code is very helpful.
I was wondering for single-end data, can I also use similar function and script? I tried to change your code into a single-end version, but it got an error:
Traceback (most recent call last):
File "subsample_se.py", line 10, in <module>
read1.write_to_fastq_file( out1 )
AttributeError: 'tuple' object has no attribute 'write_to_fastq_file'

Any idea about that? Thanks in advance!

The code is:

Code:

import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
out1 = open( sys.argv[3], "w" )

for read1 in itertools.izip( in1 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )

out1.close()

Originally posted by Simon Anders View Post

Hi Dave

should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

Code:

 
import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()

**Simon Anders** · 06-19-2011, 11:29 PM

Try replacing this line

Code:

for read1 in itertools.izip( in1 ):

with

Code:

for read1 in in1:

and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

S

**sheng** · 06-20-2011, 04:25 AM

Cool! Thanks a lot for your reply and suggestions! lol
Will do!

Originally posted by Simon Anders View Post

Try replacing this line

Code:

for read1 in itertools.izip( in1 ):

with

Code:

for read1 in in1:

and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

S

**brentp** · 06-20-2011, 06:14 AM

Originally posted by dnusol View Post

Hi,
I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.

Site not found

http://biostar.stackexchange.com/questions/6567/selecting-random-pairs-from-fastq

We make Stack Overflow and 170+ other community-powered Q&A sites.

I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)

Dave

Dave, you can use just the bit posted in that answer. You dont need the linked original script.

**jbar** · 09-14-2011, 04:02 AM

Hi Simon,

Thank you very much for a nice tool. Is there any possibility to modify the script? I will apreciate, if I can set number of reads in output rather than fraction of original file.

Best regards.
Jan

**JahnDavik** · 04-17-2013, 07:22 AM

sampling fw and rev

Originally posted by Simon Anders View Post

Hi Dave

should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

Code:

 
import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()

Another one from a biologist: Will this routing sample the pairs belonging to each other? As opposed to a random selection from each of the fw and rev files, I mean.

**Simon Anders** · 04-17-2013, 08:46 AM

Of course. Otherwise, it would be a bit pointless.

**haripriya** · 11-06-2013, 09:56 AM

Originally posted by Simon Anders View Post

This sounds like a job for HTSeq:

Code:

import sys, random
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

while True:
   read1 = next( in1 )
   read2 = next( in2 )
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()

Save this as subsample.py and call it as

Code:

python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>

(where <fraction> is a number between 0 and 1, giving the sampling faction)

Simon

Hi Simon,

When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?

I am trying to extract a random sample of SE fastq files and am going to try your script.

thanks!

**Simon Anders** · 11-06-2013, 10:05 AM

Originally posted by haripriya View Post

]
When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?

Sorry, I don't understand your question. Why should a fraction be between 0 and 100? Or do you mean a percentage?

If you want to sub-sample, e.g., a quarter of the reads, you use 0.25 as fraction.

**haripriya** · 11-06-2013, 10:12 AM

Yes sorry my bad

I was thinking about percentages.

**everestial** · 03-06-2016, 07:35 AM

This is still helpful !

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 49 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

random subset paired-end fastq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News