Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dnusol
    Senior Member
    • Jul 2009
    • 136

    random subset paired-end fastq

    Hi,
    I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.



    I have tried Aaron's and and Pierre's bash options but I run out of memory. I also tried Martin's R commands but I also run out of memory.

    I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)

    Can anyone help me with that?

    Cheers,

    Dave
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    This sounds like a job for HTSeq:

    Code:
    import sys, random
    import HTSeq
    
    fraction = float( sys.argv[1] )
    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
    out1 = open( sys.argv[4], "w" )
    out2 = open( sys.argv[5], "w" )
    
    while True:
       read1 = next( in1 )
       read2 = next( in2 )
       if random.random() < fraction:
          read1.write_to_fastq_file( out1 )
          read2.write_to_fastq_file( out2 )
          
    out1.close()
    out2.close()
    Save this as subsample.py and call it as
    Code:
    python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
    (where <fraction> is a number between 0 and 1, giving the sampling faction)

    Simon

    Comment

    • dnusol
      Senior Member
      • Jul 2009
      • 136

      #3
      Hi Simon, thanks for your help. I run the script and it seemed to work, the new files are generated, but I get this stderr:

      Traceback (most recent call last):
      File "/home/david/pyscripts/fastqPairedSubsample.py", line 15, in <module>
      read1 = next( in1 )
      File "/usr/local/lib/python2.6/dist-packages/HTSeq/__init__.py", line 381, in __iter__
      id1 = fin.next()
      StopIteration

      Is that something I should worry about? What does it mean?

      Cheers,

      Dave

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        Hi Dave

        should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

        Code:
         
        import sys, random, itertools
        import HTSeq
        
        fraction = float( sys.argv[1] )
        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
        out1 = open( sys.argv[4], "w" )
        out2 = open( sys.argv[5], "w" )
        
        for read1, read2 in itertools.izip( in1, in2 ):
           if random.random() < fraction:
              read1.write_to_fastq_file( out1 )
              read2.write_to_fastq_file( out2 )
              
        out1.close()
        out2.close()
        Last edited by Simon Anders; 06-16-2011, 07:09 AM. Reason: made code nicer

        Comment

        • sheng
          Junior Member
          • Apr 2011
          • 5

          #5
          Single-end version of random sampling?

          Hi Simon,

          Thanks a lot for presenting this python code for pair-end seq data random sampling. I am new to python and the code is very helpful.
          I was wondering for single-end data, can I also use similar function and script? I tried to change your code into a single-end version, but it got an error:
          Traceback (most recent call last):
          File "subsample_se.py", line 10, in <module>
          read1.write_to_fastq_file( out1 )
          AttributeError: 'tuple' object has no attribute 'write_to_fastq_file'

          Any idea about that? Thanks in advance!

          The code is:
          Code:
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          out1 = open( sys.argv[3], "w" )
          
          for read1 in itertools.izip( in1 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
          
          out1.close()

          Originally posted by Simon Anders View Post
          Hi Dave

          should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

          Code:
           
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
          out1 = open( sys.argv[4], "w" )
          out2 = open( sys.argv[5], "w" )
          
          for read1, read2 in itertools.izip( in1, in2 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
                read2.write_to_fastq_file( out2 )
                
          out1.close()
          out2.close()

          Comment

          • Simon Anders
            Senior Member
            • Feb 2010
            • 995

            #6
            Try replacing this line

            Code:
            for read1 in itertools.izip( in1 ):
            with

            Code:
            for read1 in in1:
            and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

            S

            Comment

            • sheng
              Junior Member
              • Apr 2011
              • 5

              #7
              Cool! Thanks a lot for your reply and suggestions! lol
              Will do!

              Originally posted by Simon Anders View Post
              Try replacing this line

              Code:
              for read1 in itertools.izip( in1 ):
              with

              Code:
              for read1 in in1:
              and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

              S

              Comment

              • brentp
                Member
                • Apr 2010
                • 72

                #8
                Originally posted by dnusol View Post
                Hi,
                I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.


                I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)


                Dave
                Dave, you can use just the bit posted in that answer. You dont need the linked original script.

                Comment

                • jbar
                  Junior Member
                  • Jun 2011
                  • 1

                  #9
                  Hi Simon,

                  Thank you very much for a nice tool. Is there any possibility to modify the script? I will apreciate, if I can set number of reads in output rather than fraction of original file.

                  Best regards.
                  Jan

                  Comment

                  • JahnDavik
                    Junior Member
                    • Aug 2012
                    • 8

                    #10
                    sampling fw and rev

                    Originally posted by Simon Anders View Post
                    Hi Dave

                    should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

                    Code:
                     
                    import sys, random, itertools
                    import HTSeq
                    
                    fraction = float( sys.argv[1] )
                    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                    out1 = open( sys.argv[4], "w" )
                    out2 = open( sys.argv[5], "w" )
                    
                    for read1, read2 in itertools.izip( in1, in2 ):
                       if random.random() < fraction:
                          read1.write_to_fastq_file( out1 )
                          read2.write_to_fastq_file( out2 )
                          
                    out1.close()
                    out2.close()

                    Another one from a biologist: Will this routing sample the pairs belonging to each other? As opposed to a random selection from each of the fw and rev files, I mean.

                    Comment

                    • Simon Anders
                      Senior Member
                      • Feb 2010
                      • 995

                      #11
                      Of course. Otherwise, it would be a bit pointless.

                      Comment

                      • haripriya
                        Junior Member
                        • Jun 2013
                        • 2

                        #12
                        Originally posted by Simon Anders View Post
                        This sounds like a job for HTSeq:

                        Code:
                        import sys, random
                        import HTSeq
                        
                        fraction = float( sys.argv[1] )
                        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                        out1 = open( sys.argv[4], "w" )
                        out2 = open( sys.argv[5], "w" )
                        
                        while True:
                           read1 = next( in1 )
                           read2 = next( in2 )
                           if random.random() < fraction:
                              read1.write_to_fastq_file( out1 )
                              read2.write_to_fastq_file( out2 )
                              
                        out1.close()
                        out2.close()
                        Save this as subsample.py and call it as
                        Code:
                        python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
                        (where <fraction> is a number between 0 and 1, giving the sampling faction)

                        Simon
                        Hi Simon,

                        When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?

                        I am trying to extract a random sample of SE fastq files and am going to try your script.

                        thanks!

                        Comment

                        • Simon Anders
                          Senior Member
                          • Feb 2010
                          • 995

                          #13
                          Originally posted by haripriya View Post
                          ]
                          When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?
                          Sorry, I don't understand your question. Why should a fraction be between 0 and 100? Or do you mean a percentage?

                          If you want to sub-sample, e.g., a quarter of the reads, you use 0.25 as fraction.

                          Comment

                          • haripriya
                            Junior Member
                            • Jun 2013
                            • 2

                            #14
                            Yes sorry my bad I was thinking about percentages.

                            Comment

                            • everestial
                              Member
                              • Feb 2015
                              • 31

                              #15
                              This is still helpful !

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 10:09 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              15 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              24 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              20 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...