Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unified plain/gzip'ed fasta/fastq parser in C/C++ (for developers only)

    One of the little tricks in bwa is it seamlessly read plain and gzip-compressed input files in either fasta or fastq format. This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.

    If some developers may be interested in this feature, they may have look at this page which gives more details about this single-file (<200 lines) efficient and unified parser (even for multi-line fastq). Sorry, this sounds like another advertisement.

  • #2
    Originally posted by lh3 View Post
    One of the little tricks in bwa is it seamlessly read plain and gzip-compressed input files in either fasta or fastq format. This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.

    If some developers may be interested in this feature, they may have look at this page which gives more details about this single-file (<200 lines) efficient and unified parser (even for multi-line fastq). Sorry, this sounds like another advertisement.
    Could you add to BWA the ability to read bz2 files using libbzip2 too? I know we use bzip2 since we can use pbzip2, the multi-threaded version of bzip2, which makes compression/decompression much faster on 8 and 16 core machines. I also think it would be more appropriate to read in a BWT compressed file (bz2) and then use a BWT alignment algorithm (BWA).

    Now if we could only convince the vendors to output to gzipped (or bzipped) FASTQ format. It would also be nice if they could split the reads into chunks for parallel computation (say 10M from the 1B produced by a SOLiD). This would avoid another pre-processing step.

    Comment


    • #3
      Originally posted by lh3 View Post
      This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.
      While its not a huge advantage, it may be a bigger advantage than some folks realize. Disk IO is roughly 100x slower than memory, so reading a compressed file means you're probably reading much less off the disk. The extra CPU work for the decompression is still much faster than the additional Disk IO, so working directly with compressed files saves runtime and disk space. The gains can often be much larger than might be naively expected.

      Comment


      • #4
        @Niles

        Bzip2 is too slow. On decompression, it is several times slower than gzip, which is a lot. That is why in BAM we use zlib rather than bzlib. I know they are similar in APIs. Nonetheless, you can easily use my library to open bzip'ed input, although you cannot "seamlessly" open gzip'ed and bzip'ed files at the same time.

        @cariaso

        In most cases, we can wrap the aligner to make it take gzip'ed files as input. If the aligner read from STDIN, we use the anonymous pipe; if not, use named pipe, although this seems a bit complicated to most users.

        Comment


        • #5
          Originally posted by lh3 View Post
          @Niles

          Bzip2 is too slow. On decompression, it is several times slower than gzip, which is a lot. That is why in BAM we use zlib rather than bzlib. I know they are similar in APIs. Nonetheless, you can easily use my library to open bzip'ed input, although you cannot "seamlessly" open gzip'ed and bzip'ed files at the same time.

          @cariaso

          In most cases, we can wrap the aligner to make it take gzip'ed files as input. If the aligner read from STDIN, we use the anonymous pipe; if not, use named pipe, although this seems a bit complicated to most users.
          What's the license if we go ahead and use this?
          Nevermind, it is included in the distribution!
          Last edited by nilshomer; 11-21-2009, 04:23 PM.

          Comment


          • #6
            Originally posted by nilshomer View Post
            Nevermind, it is included in the distribution!
            Just to make it clearer: it is distributed under MIT/X11 license. Most of my source codes are and will be released under this license.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:47 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X