SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fastq format and Paired-end reads sandhya Bioinformatics 10 07-03-2013 04:24 AM
Extract subset of Fastq sequences based on a list of IDs pepperoni Bioinformatics 36 05-06-2013 01:38 AM
Illumina Paired End FASTQ kjsalimian Bioinformatics 2 01-05-2012 12:19 PM
Fastq: Paired end reads and mapping cedance Bioinformatics 7 06-18-2011 12:33 PM
paired end fastq format in bwa Protaeus Bioinformatics 4 12-09-2010 02:28 PM

Reply
 
Thread Tools
Old 06-15-2011, 07:33 AM   #1
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default random subset paired-end fastq

Hi,
I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.

http://biostar.stackexchange.com/que...irs-from-fastq

I have tried Aaron's and and Pierre's bash options but I run out of memory. I also tried Martin's R commands but I also run out of memory.

I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)

Can anyone help me with that?

Cheers,

Dave
dnusol is offline   Reply With Quote
Old 06-15-2011, 08:13 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

This sounds like a job for HTSeq:

Code:
import sys, random
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

while True:
   read1 = next( in1 )
   read2 = next( in2 )
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()
Save this as subsample.py and call it as
Code:
python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
(where <fraction> is a number between 0 and 1, giving the sampling faction)

Simon
Simon Anders is offline   Reply With Quote
Old 06-16-2011, 06:11 AM   #3
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Hi Simon, thanks for your help. I run the script and it seemed to work, the new files are generated, but I get this stderr:

Traceback (most recent call last):
File "/home/david/pyscripts/fastqPairedSubsample.py", line 15, in <module>
read1 = next( in1 )
File "/usr/local/lib/python2.6/dist-packages/HTSeq/__init__.py", line 381, in __iter__
id1 = fin.next()
StopIteration

Is that something I should worry about? What does it mean?

Cheers,

Dave
dnusol is offline   Reply With Quote
Old 06-16-2011, 07:04 AM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Dave

should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

Code:
 
import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()

Last edited by Simon Anders; 06-16-2011 at 07:09 AM. Reason: made code nicer
Simon Anders is offline   Reply With Quote
Old 06-19-2011, 07:54 PM   #5
sheng
Junior Member
 
Location: NYC

Join Date: Apr 2011
Posts: 5
Talking Single-end version of random sampling?

Hi Simon,

Thanks a lot for presenting this python code for pair-end seq data random sampling. I am new to python and the code is very helpful.
I was wondering for single-end data, can I also use similar function and script? I tried to change your code into a single-end version, but it got an error:
Traceback (most recent call last):
File "subsample_se.py", line 10, in <module>
read1.write_to_fastq_file( out1 )
AttributeError: 'tuple' object has no attribute 'write_to_fastq_file'

Any idea about that? Thanks in advance!

The code is:
Code:
import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
out1 = open( sys.argv[3], "w" )

for read1 in itertools.izip( in1 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )

out1.close()

Quote:
Originally Posted by Simon Anders View Post
Hi Dave

should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

Code:
 
import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()
sheng is offline   Reply With Quote
Old 06-19-2011, 11:29 PM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Try replacing this line

Code:
for read1 in itertools.izip( in1 ):
with

Code:
for read1 in in1:
and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

S
Simon Anders is offline   Reply With Quote
Old 06-20-2011, 04:25 AM   #7
sheng
Junior Member
 
Location: NYC

Join Date: Apr 2011
Posts: 5
Thumbs up

Cool! Thanks a lot for your reply and suggestions! lol
Will do!

Quote:
Originally Posted by Simon Anders View Post
Try replacing this line

Code:
for read1 in itertools.izip( in1 ):
with

Code:
for read1 in in1:
and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

S
sheng is offline   Reply With Quote
Old 06-20-2011, 06:14 AM   #8
brentp
Member
 
Location: salt lake city, UT

Join Date: Apr 2010
Posts: 72
Default

Quote:
Originally Posted by dnusol View Post
Hi,
I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.

http://biostar.stackexchange.com/que...irs-from-fastq
I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)


Dave
Dave, you can use just the bit posted in that answer. You dont need the linked original script.
brentp is offline   Reply With Quote
Old 09-14-2011, 04:02 AM   #9
jbar
Junior Member
 
Location: Czech Republic

Join Date: Jun 2011
Posts: 1
Default

Hi Simon,

Thank you very much for a nice tool. Is there any possibility to modify the script? I will apreciate, if I can set number of reads in output rather than fraction of original file.

Best regards.
Jan
jbar is offline   Reply With Quote
Old 04-17-2013, 07:22 AM   #10
JahnDavik
Junior Member
 
Location: Norway

Join Date: Aug 2012
Posts: 8
Default sampling fw and rev

Quote:
Originally Posted by Simon Anders View Post
Hi Dave

should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

Code:
 
import sys, random, itertools
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

for read1, read2 in itertools.izip( in1, in2 ):
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()

Another one from a biologist: Will this routing sample the pairs belonging to each other? As opposed to a random selection from each of the fw and rev files, I mean.
JahnDavik is offline   Reply With Quote
Old 04-17-2013, 08:46 AM   #11
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Of course. Otherwise, it would be a bit pointless.
Simon Anders is offline   Reply With Quote
Old 11-06-2013, 08:56 AM   #12
haripriya
Junior Member
 
Location: US

Join Date: Jun 2013
Posts: 2
Default

Quote:
Originally Posted by Simon Anders View Post
This sounds like a job for HTSeq:

Code:
import sys, random
import HTSeq

fraction = float( sys.argv[1] )
in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
out1 = open( sys.argv[4], "w" )
out2 = open( sys.argv[5], "w" )

while True:
   read1 = next( in1 )
   read2 = next( in2 )
   if random.random() < fraction:
      read1.write_to_fastq_file( out1 )
      read2.write_to_fastq_file( out2 )
      
out1.close()
out2.close()
Save this as subsample.py and call it as
Code:
python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
(where <fraction> is a number between 0 and 1, giving the sampling faction)

Simon
Hi Simon,

When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?

I am trying to extract a random sample of SE fastq files and am going to try your script.

thanks!
haripriya is offline   Reply With Quote
Old 11-06-2013, 09:05 AM   #13
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by haripriya View Post
]
When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?
Sorry, I don't understand your question. Why should a fraction be between 0 and 100? Or do you mean a percentage?

If you want to sub-sample, e.g., a quarter of the reads, you use 0.25 as fraction.
Simon Anders is offline   Reply With Quote
Old 11-06-2013, 09:12 AM   #14
haripriya
Junior Member
 
Location: US

Join Date: Jun 2013
Posts: 2
Default

Yes sorry my bad I was thinking about percentages.
haripriya is offline   Reply With Quote
Old 03-06-2016, 06:35 AM   #15
everestial
Member
 
Location: North Carolina

Join Date: Feb 2015
Posts: 31
Default

This is still helpful !
everestial is offline   Reply With Quote
Old 04-17-2016, 02:36 AM   #16
boro2014
Junior Member
 
Location: Belgium

Join Date: Jan 2014
Posts: 3
Default

Hi all,

The script is straightforward and useful, but after applying it to a 40M PE file to randomly extract a fraction of 0.025 and fastqc my obtaining file containing ~1M reads to check of error or corruption then I obtain an error from fastqc
boro2014 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO