Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • random subset paired-end fastq

    Hi,
    I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.



    I have tried Aaron's and and Pierre's bash options but I run out of memory. I also tried Martin's R commands but I also run out of memory.

    I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)

    Can anyone help me with that?

    Cheers,

    Dave

  • #2
    This sounds like a job for HTSeq:

    Code:
    import sys, random
    import HTSeq
    
    fraction = float( sys.argv[1] )
    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
    out1 = open( sys.argv[4], "w" )
    out2 = open( sys.argv[5], "w" )
    
    while True:
       read1 = next( in1 )
       read2 = next( in2 )
       if random.random() < fraction:
          read1.write_to_fastq_file( out1 )
          read2.write_to_fastq_file( out2 )
          
    out1.close()
    out2.close()
    Save this as subsample.py and call it as
    Code:
    python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
    (where <fraction> is a number between 0 and 1, giving the sampling faction)

    Simon

    Comment


    • #3
      Hi Simon, thanks for your help. I run the script and it seemed to work, the new files are generated, but I get this stderr:

      Traceback (most recent call last):
      File "/home/david/pyscripts/fastqPairedSubsample.py", line 15, in <module>
      read1 = next( in1 )
      File "/usr/local/lib/python2.6/dist-packages/HTSeq/__init__.py", line 381, in __iter__
      id1 = fin.next()
      StopIteration

      Is that something I should worry about? What does it mean?

      Cheers,

      Dave

      Comment


      • #4
        Hi Dave

        should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

        Code:
         
        import sys, random, itertools
        import HTSeq
        
        fraction = float( sys.argv[1] )
        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
        out1 = open( sys.argv[4], "w" )
        out2 = open( sys.argv[5], "w" )
        
        for read1, read2 in itertools.izip( in1, in2 ):
           if random.random() < fraction:
              read1.write_to_fastq_file( out1 )
              read2.write_to_fastq_file( out2 )
              
        out1.close()
        out2.close()
        Last edited by Simon Anders; 06-16-2011, 07:09 AM. Reason: made code nicer

        Comment


        • #5
          Single-end version of random sampling?

          Hi Simon,

          Thanks a lot for presenting this python code for pair-end seq data random sampling. I am new to python and the code is very helpful.
          I was wondering for single-end data, can I also use similar function and script? I tried to change your code into a single-end version, but it got an error:
          Traceback (most recent call last):
          File "subsample_se.py", line 10, in <module>
          read1.write_to_fastq_file( out1 )
          AttributeError: 'tuple' object has no attribute 'write_to_fastq_file'

          Any idea about that? Thanks in advance!

          The code is:
          Code:
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          out1 = open( sys.argv[3], "w" )
          
          for read1 in itertools.izip( in1 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
          
          out1.close()

          Originally posted by Simon Anders View Post
          Hi Dave

          should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

          Code:
           
          import sys, random, itertools
          import HTSeq
          
          fraction = float( sys.argv[1] )
          in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
          in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
          out1 = open( sys.argv[4], "w" )
          out2 = open( sys.argv[5], "w" )
          
          for read1, read2 in itertools.izip( in1, in2 ):
             if random.random() < fraction:
                read1.write_to_fastq_file( out1 )
                read2.write_to_fastq_file( out2 )
                
          out1.close()
          out2.close()

          Comment


          • #6
            Try replacing this line

            Code:
            for read1 in itertools.izip( in1 ):
            with

            Code:
            for read1 in in1:
            and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

            S

            Comment


            • #7
              Cool! Thanks a lot for your reply and suggestions! lol
              Will do!

              Originally posted by Simon Anders View Post
              Try replacing this line

              Code:
              for read1 in itertools.izip( in1 ):
              with

              Code:
              for read1 in in1:
              and maybe have a look at the Python Tutorial; you'll see that it doesn't take that much time to learn Python. ;-)

              S

              Comment


              • #8
                Originally posted by dnusol View Post
                Hi,
                I found this page where they discussed different methods for selecting pairs of reads randomly from a set of fastq files.


                I wanted to try Brent's python script but I cannot understand whether I have to use only the bit shown in the link or I have to insert this bit in the original script he mentions wrote for single-end reads (link in the same answer: https://github.com/brentp/bio-playgr...mples/bench.py)


                Dave
                Dave, you can use just the bit posted in that answer. You dont need the linked original script.

                Comment


                • #9
                  Hi Simon,

                  Thank you very much for a nice tool. Is there any possibility to modify the script? I will apreciate, if I can set number of reads in output rather than fraction of original file.

                  Best regards.
                  Jan

                  Comment


                  • #10
                    sampling fw and rev

                    Originally posted by Simon Anders View Post
                    Hi Dave

                    should be fine, the error just indicates that it finished reading the input, and I forgot to catch it. Correction (untested, sorry):

                    Code:
                     
                    import sys, random, itertools
                    import HTSeq
                    
                    fraction = float( sys.argv[1] )
                    in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                    in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                    out1 = open( sys.argv[4], "w" )
                    out2 = open( sys.argv[5], "w" )
                    
                    for read1, read2 in itertools.izip( in1, in2 ):
                       if random.random() < fraction:
                          read1.write_to_fastq_file( out1 )
                          read2.write_to_fastq_file( out2 )
                          
                    out1.close()
                    out2.close()

                    Another one from a biologist: Will this routing sample the pairs belonging to each other? As opposed to a random selection from each of the fw and rev files, I mean.

                    Comment


                    • #11
                      Of course. Otherwise, it would be a bit pointless.

                      Comment


                      • #12
                        Originally posted by Simon Anders View Post
                        This sounds like a job for HTSeq:

                        Code:
                        import sys, random
                        import HTSeq
                        
                        fraction = float( sys.argv[1] )
                        in1 = iter( HTSeq.FastqReader( sys.argv[2] ) )
                        in2 = iter( HTSeq.FastqReader( sys.argv[3] ) )
                        out1 = open( sys.argv[4], "w" )
                        out2 = open( sys.argv[5], "w" )
                        
                        while True:
                           read1 = next( in1 )
                           read2 = next( in2 )
                           if random.random() < fraction:
                              read1.write_to_fastq_file( out1 )
                              read2.write_to_fastq_file( out2 )
                              
                        out1.close()
                        out2.close()
                        Save this as subsample.py and call it as
                        Code:
                        python subsample.py <fraction> <input file 1> <input file 2> <output file 1> <output file 2>
                        (where <fraction> is a number between 0 and 1, giving the sampling faction)

                        Simon
                        Hi Simon,

                        When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?

                        I am trying to extract a random sample of SE fastq files and am going to try your script.

                        thanks!

                        Comment


                        • #13
                          Originally posted by haripriya View Post
                          ]
                          When you say a number between 0 and 1, do you mean 0 and 100 instead, for fraction?
                          Sorry, I don't understand your question. Why should a fraction be between 0 and 100? Or do you mean a percentage?

                          If you want to sub-sample, e.g., a quarter of the reads, you use 0.25 as fraction.

                          Comment


                          • #14
                            Yes sorry my bad I was thinking about percentages.

                            Comment


                            • #15
                              This is still helpful !

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X