Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unified plain/gzip'ed fasta/fastq parser in C/C++ (for developers only)

    One of the little tricks in bwa is it seamlessly read plain and gzip-compressed input files in either fasta or fastq format. This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.

    If some developers may be interested in this feature, they may have look at this page which gives more details about this single-file (<200 lines) efficient and unified parser (even for multi-line fastq). Sorry, this sounds like another advertisement.

  • #2
    Originally posted by lh3 View Post
    One of the little tricks in bwa is it seamlessly read plain and gzip-compressed input files in either fasta or fastq format. This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.

    If some developers may be interested in this feature, they may have look at this page which gives more details about this single-file (<200 lines) efficient and unified parser (even for multi-line fastq). Sorry, this sounds like another advertisement.
    Could you add to BWA the ability to read bz2 files using libbzip2 too? I know we use bzip2 since we can use pbzip2, the multi-threaded version of bzip2, which makes compression/decompression much faster on 8 and 16 core machines. I also think it would be more appropriate to read in a BWT compressed file (bz2) and then use a BWT alignment algorithm (BWA).

    Now if we could only convince the vendors to output to gzipped (or bzipped) FASTQ format. It would also be nice if they could split the reads into chunks for parallel computation (say 10M from the 1B produced by a SOLiD). This would avoid another pre-processing step.

    Comment


    • #3
      Originally posted by lh3 View Post
      This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.
      While its not a huge advantage, it may be a bigger advantage than some folks realize. Disk IO is roughly 100x slower than memory, so reading a compressed file means you're probably reading much less off the disk. The extra CPU work for the decompression is still much faster than the additional Disk IO, so working directly with compressed files saves runtime and disk space. The gains can often be much larger than might be naively expected.

      Comment


      • #4
        @Niles

        Bzip2 is too slow. On decompression, it is several times slower than gzip, which is a lot. That is why in BAM we use zlib rather than bzlib. I know they are similar in APIs. Nonetheless, you can easily use my library to open bzip'ed input, although you cannot "seamlessly" open gzip'ed and bzip'ed files at the same time.

        @cariaso

        In most cases, we can wrap the aligner to make it take gzip'ed files as input. If the aligner read from STDIN, we use the anonymous pipe; if not, use named pipe, although this seems a bit complicated to most users.

        Comment


        • #5
          Originally posted by lh3 View Post
          @Niles

          Bzip2 is too slow. On decompression, it is several times slower than gzip, which is a lot. That is why in BAM we use zlib rather than bzlib. I know they are similar in APIs. Nonetheless, you can easily use my library to open bzip'ed input, although you cannot "seamlessly" open gzip'ed and bzip'ed files at the same time.

          @cariaso

          In most cases, we can wrap the aligner to make it take gzip'ed files as input. If the aligner read from STDIN, we use the anonymous pipe; if not, use named pipe, although this seems a bit complicated to most users.
          What's the license if we go ahead and use this?
          Nevermind, it is included in the distribution!
          Last edited by nilshomer; 11-21-2009, 04:23 PM.

          Comment


          • #6
            Originally posted by nilshomer View Post
            Nevermind, it is included in the distribution!
            Just to make it clearer: it is distributed under MIT/X11 license. Most of my source codes are and will be released under this license.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X