SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert merged BAM back to per lane BAM or FASTQ file danielsbrewer Bioinformatics 6 10-03-2013 07:29 AM
BWA sam and Samtools sam->bam conversion problem maasha Bioinformatics 6 06-05-2013 07:39 AM
Reverse engineering BAM files: BAM -> FASTQ gene coder Bioinformatics 3 01-03-2012 02:42 PM
NEw to Chip-seq and have .bam/.sam/.bam.bai files... then what? NGS newbie Bioinformatics 11 05-25-2011 07:48 AM
How long biotinilated Internal Adapter live? nedoluzhko SOLiD 0 06-15-2010 11:22 PM

Reply
 
Thread Tools
Old 10-21-2011, 04:45 AM   #1
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default FASTQ must die! Long live SAM/BAM!

One of the ideas mentioned on the SEQanswers letter thread was about linking blog content and discussion back to SEQanswers, so...

I've just blogged about why I think we as a community should try to move away from FASTQ as a file format for unaligned reads and use SAM/BAM instead, FASTQ must die! Long live SAM/BAM!, and will suggest people comment on this thread rather than on the blog.

This is partly because I don't seem to have got my blog comments settings right anyway
maubp is offline   Reply With Quote
Old 10-21-2011, 10:52 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I'm not sure if there is much to say. Fewer formats in bioinformatics would be good. Programs that read and write to all common formats would be good. BAM/SAM is, as far as I can tell, a good enough format. We will have to see if incompatibilities pop up during the next couple of years.
westerman is offline   Reply With Quote
Old 10-21-2011, 01:17 PM   #3
camelbbs
Member
 
Location: United States

Join Date: Jun 2011
Posts: 49
Default

I want to ask a question about bam files.

I have 2 sequencing library in a same sample, and get 2 fastq files, the length of reads are 50bp and 36bp separately.
So When I do tophat, because I need to specify the -r, I cannot combine the two fastq files. But after I got the accepted.bam files, can I combine them (bam files) with the samtools merge?

thanks everyone.
camelbbs is offline   Reply With Quote
Old 10-21-2011, 01:58 PM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by camelbbs View Post
I want to ask a question about bam files.
I was going to recommend asking in a new thread, but you've done that
http://seqanswers.com/forums/showthread.php?t=14952
maubp is offline   Reply With Quote
Old 10-22-2011, 12:12 AM   #5
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Whilst I appreciate the sentiments of your argument for getting rid of fastq format, I tend to disagree.

I guess my main objections would be:

1) I like having a separation of primary data and derived data. FastQ is primary data which is never going to change. BAM/SAM is derived data which might change if you use a different read mapper, genome assembly etc.

2) I like simple plain text formats. FastQ, for all of its failings (and it certainly has those!), is a simple format which is easy to parse and deal with. SAM/BAM is much harder to get your head around. Realistically you need to use an existing library to do anything with a BAM/SAM file due to the complexities of the format.

3) FastQ is more future-proof. Because FastQ format makes no assumptions about the structure of your experiments (precisely because it contains no metadata) it makes very few assumptions about what your data is going to look like in the future. If you look at the recent changes to BAM format to get around the previous assumption of only ever having a maximum of two reads per sequence then you can see how this might go wrong in future.

We use BAM format all the time, but it's not a format I particularly like working with. You mentioned the flag field in your blog which must single-handedly have caused more trouble than any other format design decision ever made in bioinformatics! I can see the appeal of the format, but the field is still undergoing such rapid change I can see that it's probably not finished yet.
simonandrews is offline   Reply With Quote
Old 10-22-2011, 04:16 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Hi Andrew,

Thanks for your comments. You raise some good points, but I don't agree with them all.

(1) Editing of FASTQ files happens already though (quality trimming, filtering, etc) so there is no clear separation between primary data and derived data.

(2) Given how big sequence data files are getting, it is increasingly impractical to work with them as plain text (not so bad for viruses though). You can do plenty with SAM at the Unix command line, the fact it is one line per read actually helps. Any non trivial stuff yes, a SAM/BAM library helps.

(3) From a long term data archive policy going through all the SAM/BAM format revisions to try and understand what an old file means might be hard, but try extracting the meta data from a FASTQ file where there are 101 different filename, header or read naming conventions, many undocumented.

(unnumbered 4) I agree the representation of the FLAG in SAM as a single (decimal) integer was probably the worst design choice in the format. Even an eight character string of 0s and 1s would have been easier to understand. However, it is done, and changing it will only break things - and only benefit people working on the files directly with scripts and Unix one-line magic. If you're using a SAM/BAM library this should map the FLAG bits for you.

And I agree things will change (e.g. maybe one day we will see SAM/BAM move to HDF5 rather than the homegrown BGZF used now).

Peter

Last edited by maubp; 10-22-2011 at 04:18 AM. Reason: Typo
maubp is offline   Reply With Quote
Old 10-22-2011, 07:38 AM   #7
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

The major problem with fastq is we are unable to keep meta data. This is a disadvantage, not an advantage in almost all aspects. From this angle, SAM is at least not worse than fastq -- we can always keep the primary data only -- and SAM is arguably the only universal way to keep meta data. It is true that we may need to change SAM when a new technology comes with new read structures or new types of information, but other solutions are no better. We need to design something new anyway. Then why not just add to SAM? I do not know the decision process at Sanger and Broad about the use of BAM to store the primary data. I would guess the ability to keep meta data in BAM is a key.

On the other hand, I do not see fastq dying. SAM/BAM is too heavy. Parsing SAM/BAM by yourself is really a pain especially in C. I know many will argue that a SAM/BAM library is available to each mainstream programming language. But there are developers like me who resist using a non-standard external library for something that is supposed to be simple and has little to do with the core algorithm. This is my philosophy of implementing algorithms, even if a bad one. In this line, it is easy to imagine my resistance to HDF5. And this resistance is not all about my personal opinion: BGZF indeed has several technical advantages over HDF5 which makes BGZF more suitable for SAM/BAM. Actually the simplicity of BGZF alone is strong enough to win me over.

Back to the topic. SAM/BAM is good, but it is not for everything and for everyone. Fastq has its niche and will long live, if not outlive SAM/BAM.

Last edited by lh3; 10-23-2011 at 08:29 PM. Reason: fixed grammatical errors
lh3 is offline   Reply With Quote
Old 10-22-2011, 11:09 AM   #8
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default sequence storage interface

One thing that I would like to see is a clear separation between the interface and the implementation of these sequence storage formats - similar to the relationship between graphics and OpenGL, for example. An interface that allows the user to extract certain information from the data with guaranteed time/space complexity bounds would help in hiding some of the details of the low level implementation. For example, as long as one could extract intervals that overlap a certain range, it wouldn't matter if it was done using UCSC binning scheme, augmented intervals, nested-containment lists, or something else with similar complexity behaviors.

BAM/SAM could act as a model implementation of the interface and serve as a proof-of-concept that such an interface can be satisfied. This way, the tools that people write won't break when the implementation changes or if there is a switch to a new storage format.
BAMseek is offline   Reply With Quote
Old 10-22-2011, 11:46 AM   #9
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

That is like the sequence alignment APIs we were discussing. It is definitely a good thing, but I have never got time to do that for SAM/BAM.
lh3 is offline   Reply With Quote
Old 10-24-2011, 09:47 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by lh3 View Post
The major problem with fastq is we are unable to keep meta data. This is a disadvantage, not an advantage in almost all aspects. From this angle, SAM is at least not worse than fastq -- we can always keep the primary data only -- and SAM is arguably the only universal way to keep meta data. It is true that we may need to change SAM when a new technology comes with new read structures or new types of information, but other solutions are no better. We need to design something new anyway. Then why not just add to SAM? I do not know the decision process at Sanger and Broad about the use of BAM to store the primary data. I would guess the ability to keep meta data in BAM is a key.
Here we agree. Maybe I should mention the Broad on the blog post too...

Quote:
Originally Posted by lh3 View Post
On the other hand, I do not see fastq dying. SAM/BAM is too heavy. Parsing SAM/BAM by yourself is really a pain especially in C. I know many will argue that a SAM/BAM library is available to each mainstream programming language. But there are developers like me who resist using a non-standard external library for something that is supposed to be simple and has little to do with the core algorithm. This is my philosophy of implementing algorithms, even if a bad one.
Here I do disagree with you - there is a time and a place for writing your own library functions, but in this example I think using a library for parsing SAM/BAM is very sensible - especially if it lets you spend more time on the core algorithm and less on the file IO.

Quote:
Originally Posted by lh3 View Post
In this line, it is easy to imagine my resistance to HDF5. And this resistance is not all about my personal opinion: BGZF indeed has several technical advantages over HDF5 which makes BGZF more suitable for SAM/BAM. Actually the simplicity of BGZF alone is strong enough to win me over.
I'm coming to like BGZF, and thinking about how to use it for other sequential (in the sense of one record after another) file formats like FASTA, FASTQ, GenBank etc. BGZF gives you almost as good compression as gzip, but makes random access much more efficient.

Quote:
Originally Posted by lh3 View Post
Back to the topic. SAM/BAM is good, but it is not for everything and for everyone. Fastq has its niche and will long live, if not outlive SAM/BAM.
I suspect you're right - but I would still like to see FASTQ replaced sooner rather than later

Last edited by maubp; 11-23-2011 at 06:59 AM. Reason: Fixed autocorrection of Broad to Board.
maubp is offline   Reply With Quote
Old 11-08-2011, 01:03 PM   #11
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by maubp View Post
I'm coming to like BGZF, and thinking about how to use it for other sequential (in the sense of one record after another) file formats like FASTA, FASTQ, GenBank etc. BGZF gives you almost as good compression as gzip, but makes random access much more efficient.
I've looked at this in more detail now, and think BGZF could be much more widely used, see this blog post and forum thread:
http://blastedbio.blogspot.com/2011/...tter-gzip.html
http://seqanswers.com/forums/showthread.php?t=15347
maubp is offline   Reply With Quote
Old 11-29-2016, 11:28 AM   #12
RamakrishnanRS
Junior Member
 
Location: New York

Join Date: Oct 2012
Posts: 9
Default Where are we today?

Where do we stand on this today? If someone were to build a pipeline, what are the data points they should look at to decide between FASTQ and uBAM?

Most of all, file size concerns me. I no longer work on FASTQ, but when I did (1.5 years ago), they were 4-5 gigs, gzipped (WGS, 30X). I've never encountered uBAMs, but BAMs are 60+ gigs. Am I wrong comparing BAMs to uBAMs? Are the exponentially different in size? How would a WGS 30X uBAM compare in size to a FASTQ from the same experiment?
__________________
Ram
RamakrishnanRS is offline   Reply With Quote
Old 11-29-2016, 11:43 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,494
Default

I think we are right where we were when this thread started. Gzipped fastq files is still the most common deliverable for sequencing AFAIK. I believe PacBio has started moving to a variant of BAM with the new SMRTportal v.3.0 but no change in that direction from Illumina.

You are free to choose any format that suites your internal needs.
GenoMax is offline   Reply With Quote
Old 11-29-2016, 01:04 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I find gzipped fastq to be the most convenient. The sam/bam specification has a lot of limitations, like read 1 and read 2 having the same name. uBam is just what some random person decided to call "unmapped bam". They're still bam files.

Gzipped fastq is smaller and faster to process than unmapped bam. I just ran a test on 100k reads with these commands:

reformat.sh in=reads.fq.gz out=100k.fq.gz zl=6 ow reads=100k
reformat.sh in=reads.fq.gz out=100k_u.sam.gz zl=6 ow reads=100k
reformat.sh in=reads.fq.gz out=100k_u.bam zl=6 ow reads=100k

These are the sizes:

Code:
-rw-rw-r-- 1 bushnell genome 8784821 Nov 29 13:57 100k.fq.gz
-rw-rw-r-- 1 bushnell genome 9011991 Nov 29 13:58 100k_u.bam
-rw-rw-r-- 1 bushnell genome 8815867 Nov 29 13:57 100k_u.sam.gz
Write times:
fq.gz: 0.382 seconds
sam.gz: 0.400 seconds
bam: 1.958 seconds

Read times:
fq.gz: 0.304 seconds
sam.gz: 0.375 seconds
bam: 0.470 seconds

CPU-time (reading):
fq.gz: 1.438s
sam.gz: 1.431s
bam: 1.814s

So in addition to being inconvenient, unmapped bam is universally worse from a performance and space perspective.
Brian Bushnell is offline   Reply With Quote
Old 12-14-2016, 10:40 AM   #15
StackerEd
Junior Member
 
Location: New Jersey

Join Date: May 2016
Posts: 1
Default

sometimes you don't need alignments you need the raw reads, so long live FASTQ
StackerEd is offline   Reply With Quote
Old 12-14-2016, 11:09 AM   #16
RamakrishnanRS
Junior Member
 
Location: New York

Join Date: Oct 2012
Posts: 9
Default Ummm

Quote:
Originally Posted by StackerEd View Post
sometimes you don't need alignments you need the raw reads, so long live FASTQ
Ummm uBAM doesn't have alignments. It's called "unaligned BAM" for a reason.
__________________
Ram
RamakrishnanRS is offline   Reply With Quote
Reply

Tags
bam, fastq, sam

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO