Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unified plain/gzip'ed fasta/fastq parser in C/C++ (for developers only)

    One of the little tricks in bwa is it seamlessly read plain and gzip-compressed input files in either fasta or fastq format. This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.

    If some developers may be interested in this feature, they may have look at this page which gives more details about this single-file (<200 lines) efficient and unified parser (even for multi-line fastq). Sorry, this sounds like another advertisement.

  • #2
    Originally posted by lh3 View Post
    One of the little tricks in bwa is it seamlessly read plain and gzip-compressed input files in either fasta or fastq format. This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.

    If some developers may be interested in this feature, they may have look at this page which gives more details about this single-file (<200 lines) efficient and unified parser (even for multi-line fastq). Sorry, this sounds like another advertisement.
    Could you add to BWA the ability to read bz2 files using libbzip2 too? I know we use bzip2 since we can use pbzip2, the multi-threaded version of bzip2, which makes compression/decompression much faster on 8 and 16 core machines. I also think it would be more appropriate to read in a BWT compressed file (bz2) and then use a BWT alignment algorithm (BWA).

    Now if we could only convince the vendors to output to gzipped (or bzipped) FASTQ format. It would also be nice if they could split the reads into chunks for parallel computation (say 10M from the 1B produced by a SOLiD). This would avoid another pre-processing step.

    Comment


    • #3
      Originally posted by lh3 View Post
      This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.
      While its not a huge advantage, it may be a bigger advantage than some folks realize. Disk IO is roughly 100x slower than memory, so reading a compressed file means you're probably reading much less off the disk. The extra CPU work for the decompression is still much faster than the additional Disk IO, so working directly with compressed files saves runtime and disk space. The gains can often be much larger than might be naively expected.

      Comment


      • #4
        @Niles

        Bzip2 is too slow. On decompression, it is several times slower than gzip, which is a lot. That is why in BAM we use zlib rather than bzlib. I know they are similar in APIs. Nonetheless, you can easily use my library to open bzip'ed input, although you cannot "seamlessly" open gzip'ed and bzip'ed files at the same time.

        @cariaso

        In most cases, we can wrap the aligner to make it take gzip'ed files as input. If the aligner read from STDIN, we use the anonymous pipe; if not, use named pipe, although this seems a bit complicated to most users.

        Comment


        • #5
          Originally posted by lh3 View Post
          @Niles

          Bzip2 is too slow. On decompression, it is several times slower than gzip, which is a lot. That is why in BAM we use zlib rather than bzlib. I know they are similar in APIs. Nonetheless, you can easily use my library to open bzip'ed input, although you cannot "seamlessly" open gzip'ed and bzip'ed files at the same time.

          @cariaso

          In most cases, we can wrap the aligner to make it take gzip'ed files as input. If the aligner read from STDIN, we use the anonymous pipe; if not, use named pipe, although this seems a bit complicated to most users.
          What's the license if we go ahead and use this?
          Nevermind, it is included in the distribution!
          Last edited by nilshomer; 11-21-2009, 04:23 PM.

          Comment


          • #6
            Originally posted by nilshomer View Post
            Nevermind, it is included in the distribution!
            Just to make it clearer: it is distributed under MIT/X11 license. Most of my source codes are and will be released under this license.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            33 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            34 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            46 views
            0 likes
            Last Post seqadmin  
            Working...
            X