Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract subset of fastq based on length sequence??

    Hi,

    Does anyone has a script in Perl to extract a subset of fastq sequences based on length sequence?

    thanks very much!!

  • #2
    Hi,

    I've written a python code which could do the same job for you.
    unzip the gz files
    Input.fastq.gz
    Filter_fastq_by_Sequence_length.py.gz

    The input.fastq file has 50 sequence reads which are of varying length from 22 bp, 33bp, 36bp and 41 bp... This is just a model

    Execute the following code in command line:
    for help
    python Filter_fastq_by_Sequence_length.py -h

    Code:
    python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 -o Output.fastq

    Once is code is executed successfully,
    The Output.fastq file created will have 2 sequences reads of 22 bp each

    Try to excute length - 33, 36, 41 and 0 to understand how the program works.

    Then, You could try your input file on this code and change the length.
    It should hopefully work.

    Let me know how it goes and in case you need any help.
    --
    Thanks
    Attached Files

    Comment


    • #3
      subset fastq according to sequence lengths

      Hi,

      Thanks for your Python script.

      However, when I was trying to run it in my Mac (OSX) I got the following error message:

      d-128-54-196:PythonApps yb8d$ python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 Output.fastq
      Using Following inputs
      Input file is Input.fastq
      Seq_length is 22
      Output file is
      Filtering in Progress......
      Traceback (most recent call last):
      File "Filter_fastq_by_Sequence_length.py", line 58, in <module>
      filter_by_len(param[0],param[1],param[2])
      File "Filter_fastq_by_Sequence_length.py", line 6, in filter_by_len
      f=open(ofile,'w')
      IOError: [Errno 2] No such file or directory: ''

      Can you shed some light as what caused this error?

      Best

      Wing

      Comment


      • #4
        You need an "-o" in front of "Output.fastq":

        Code:
        python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 -o Output.fastq

        Comment


        • #5
          Hi Wing,

          Devon's solution for the problem is right. Thanks.
          The script errored out, as it was not able to recognize the
          outfile file.

          Originally posted by dpryan View Post
          You need an "-o" in front of "Output.fastq":

          Code:
          python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 -o Output.fastq
          Thanks
          --
          Muthu

          Comment


          • #6
            Hi Muthu et al.,

            Thanks much for the quick reply for picking up my stupid omission of a main switch. After the fix, I am happy to report that everything works beautifully.

            Wing

            Comment


            • #7
              Hii every one

              I have two fastq files of raw reads from Ion_PGM.. I just want to know that is it possible to get the stat of how many Q20 reads it has?? and is it possible to extract those reads in fastq format?? Can i extract the reads of 100bases using the following script??

              Thanx for any help in advance

              Regards

              Chayan

              Comment


              • #8
                Chayan,
                The script only allows you to extract Fastq sequences by length and not by quality.
                Hopefully you would have figures that out by now. sorry for the late reply.

                Thanks
                --
                Muthu

                Comment


                • #9
                  BBTools has a script called reformat.sh which will allow extraction of reads with a minimum average quality of at least X (maq=X) or minimum read length of at least Y (minlength=Y). It can also write a histogram of the read qualities (aqhist=) using linear and logarithmic averages. Requires Java.

                  reformat.sh in=reads.fq out=filtered.fq maq=20 minlength=100 aqhist=hist.txt

                  Comment


                  • #10
                    Okk thanks to both of you, additionally is there a tool or utility which allow k-mer based read extraction?

                    Comment


                    • #11
                      Depends on exactly what you have in mind, but I wrote a tool (BBDuk) that will filter reads based on the presence of specific kmers. For example:

                      bbduk.sh -Xmx1g in=reads.fq out=unmatched.fq outm=matched.fq ref=kmers.fa k=31

                      That will split the file reads.fq into two output files, one containing reads with kmers matching the reference, and one with the rest of the reads, using a kmer length of 31.

                      Comment


                      • #12
                        Okk i understand..but i want a different utility..i have a metagenomic read files..it is more likely that within that file reads coming from a particular organism will have a similar kind of k-mer frequency, suppose tetramer and based on this criteria i want to extract the read subsets tnd hen perform the asssembly..unfortunately here i cant use any direc reference as i am lookingt for the novel lineages..am i now clear to you??

                        Comment


                        • #13
                          Ahh, you want a binning tool. If you make a reference containing organisms that are somewhat closely related - say, at least 70% identity - you can use BBSplit. If not, well... there are various binning tools that use kmer frequency, or coverage, or both. But they don't tend to work well on short reads. I don't know of a single tool that will do a good job of solving this problem; I think it's generally addressed through a complicated pipeline involving a lot of labor.

                          Comment


                          • #14
                            Filtering sequences in a range

                            Hi,

                            Thanks for the script above, it works really well for single lengths.
                            How could it be modified to filter a range of sequences
                            For example: 22-33 and exclude the others.

                            Thanks





                            Originally posted by muthu545 View Post
                            Hi,

                            I've written a python code which could do the same job for you.
                            unzip the gz files
                            Input.fastq.gz
                            Filter_fastq_by_Sequence_length.py.gz

                            The input.fastq file has 50 sequence reads which are of varying length from 22 bp, 33bp, 36bp and 41 bp... This is just a model

                            Execute the following code in command line:
                            for help
                            python Filter_fastq_by_Sequence_length.py -h

                            Code:
                            python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 -o Output.fastq

                            Once is code is executed successfully,
                            The Output.fastq file created will have 2 sequences reads of 22 bp each

                            Try to excute length - 33, 36, 41 and 0 to understand how the program works.

                            Then, You could try your input file on this code and change the length.
                            It should hopefully work.

                            Let me know how it goes and in case you need any help.
                            --
                            Thanks

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            8 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            8 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            67 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X