SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Recover corrupt illumna fastq gzip file rnathg Bioinformatics 0 01-24-2012 07:46 PM
MAQ-convert fasta to fastq rururara Bioinformatics 0 12-08-2011 12:06 AM
Fastq to Fasta ardmore Bioinformatics 6 11-17-2011 06:56 AM
converting consensus fastq to fasta zlu Bioinformatics 18 08-17-2011 10:11 AM
fastq to fasta conversion kwtennis311 Bioinformatics 4 06-11-2010 12:06 PM

Reply
 
Thread Tools
Old 11-19-2009, 12:59 PM   #1
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default Unified plain/gzip'ed fasta/fastq parser in C/C++ (for developers only)

One of the little tricks in bwa is it seamlessly read plain and gzip-compressed input files in either fasta or fastq format. This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.

If some developers may be interested in this feature, they may have look at this page which gives more details about this single-file (<200 lines) efficient and unified parser (even for multi-line fastq). Sorry, this sounds like another advertisement.
lh3 is offline   Reply With Quote
Old 11-19-2009, 02:17 PM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lh3 View Post
One of the little tricks in bwa is it seamlessly read plain and gzip-compressed input files in either fasta or fastq format. This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.

If some developers may be interested in this feature, they may have look at this page which gives more details about this single-file (<200 lines) efficient and unified parser (even for multi-line fastq). Sorry, this sounds like another advertisement.
Could you add to BWA the ability to read bz2 files using libbzip2 too? I know we use bzip2 since we can use pbzip2, the multi-threaded version of bzip2, which makes compression/decompression much faster on 8 and 16 core machines. I also think it would be more appropriate to read in a BWT compressed file (bz2) and then use a BWT alignment algorithm (BWA).

Now if we could only convince the vendors to output to gzipped (or bzipped) FASTQ format. It would also be nice if they could split the reads into chunks for parallel computation (say 10M from the 1B produced by a SOLiD). This would avoid another pre-processing step.
nilshomer is offline   Reply With Quote
Old 11-19-2009, 02:39 PM   #3
cariaso
Member
 
Location: Wageningen, the Netherlands

Join Date: Jan 2008
Posts: 31
Default

Quote:
Originally Posted by lh3 View Post
This is not a big advantage, but it really easies users, at least me, because nearly all fastq files are gzip-compressed and we never decompress them to hard disk.
While its not a huge advantage, it may be a bigger advantage than some folks realize. Disk IO is roughly 100x slower than memory, so reading a compressed file means you're probably reading much less off the disk. The extra CPU work for the decompression is still much faster than the additional Disk IO, so working directly with compressed files saves runtime and disk space. The gains can often be much larger than might be naively expected.
cariaso is offline   Reply With Quote
Old 11-19-2009, 05:31 PM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

@Niles

Bzip2 is too slow. On decompression, it is several times slower than gzip, which is a lot. That is why in BAM we use zlib rather than bzlib. I know they are similar in APIs. Nonetheless, you can easily use my library to open bzip'ed input, although you cannot "seamlessly" open gzip'ed and bzip'ed files at the same time.

@cariaso

In most cases, we can wrap the aligner to make it take gzip'ed files as input. If the aligner read from STDIN, we use the anonymous pipe; if not, use named pipe, although this seems a bit complicated to most users.
lh3 is offline   Reply With Quote
Old 11-21-2009, 04:19 PM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lh3 View Post
@Niles

Bzip2 is too slow. On decompression, it is several times slower than gzip, which is a lot. That is why in BAM we use zlib rather than bzlib. I know they are similar in APIs. Nonetheless, you can easily use my library to open bzip'ed input, although you cannot "seamlessly" open gzip'ed and bzip'ed files at the same time.

@cariaso

In most cases, we can wrap the aligner to make it take gzip'ed files as input. If the aligner read from STDIN, we use the anonymous pipe; if not, use named pipe, although this seems a bit complicated to most users.
Quote:
What's the license if we go ahead and use this?
Nevermind, it is included in the distribution!

Last edited by nilshomer; 11-21-2009 at 04:23 PM.
nilshomer is offline   Reply With Quote
Old 11-21-2009, 04:30 PM   #6
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Quote:
Originally Posted by nilshomer View Post
Nevermind, it is included in the distribution!
Just to make it clearer: it is distributed under MIT/X11 license. Most of my source codes are and will be released under this license.
lh3 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:02 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO