Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    FASTQ must die! Long live SAM/BAM!

    One of the ideas mentioned on the SEQanswers letter thread was about linking blog content and discussion back to SEQanswers, so...

    I've just blogged about why I think we as a community should try to move away from FASTQ as a file format for unaligned reads and use SAM/BAM instead, FASTQ must die! Long live SAM/BAM!, and will suggest people comment on this thread rather than on the blog.

    This is partly because I don't seem to have got my blog comments settings right anyway
  • westerman
    Rick Westerman
    • Jun 2008
    • 1104

    #2
    I'm not sure if there is much to say. Fewer formats in bioinformatics would be good. Programs that read and write to all common formats would be good. BAM/SAM is, as far as I can tell, a good enough format. We will have to see if incompatibilities pop up during the next couple of years.

    Comment

    • camelbbs
      Member
      • Jun 2011
      • 49

      #3
      I want to ask a question about bam files.

      I have 2 sequencing library in a same sample, and get 2 fastq files, the length of reads are 50bp and 36bp separately.
      So When I do tophat, because I need to specify the -r, I cannot combine the two fastq files. But after I got the accepted.bam files, can I combine them (bam files) with the samtools merge?

      thanks everyone.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Originally posted by camelbbs View Post
        I want to ask a question about bam files.
        I was going to recommend asking in a new thread, but you've done that
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

        Comment

        • simonandrews
          Simon Andrews
          • May 2009
          • 870

          #5
          Whilst I appreciate the sentiments of your argument for getting rid of fastq format, I tend to disagree.

          I guess my main objections would be:

          1) I like having a separation of primary data and derived data. FastQ is primary data which is never going to change. BAM/SAM is derived data which might change if you use a different read mapper, genome assembly etc.

          2) I like simple plain text formats. FastQ, for all of its failings (and it certainly has those!), is a simple format which is easy to parse and deal with. SAM/BAM is much harder to get your head around. Realistically you need to use an existing library to do anything with a BAM/SAM file due to the complexities of the format.

          3) FastQ is more future-proof. Because FastQ format makes no assumptions about the structure of your experiments (precisely because it contains no metadata) it makes very few assumptions about what your data is going to look like in the future. If you look at the recent changes to BAM format to get around the previous assumption of only ever having a maximum of two reads per sequence then you can see how this might go wrong in future.

          We use BAM format all the time, but it's not a format I particularly like working with. You mentioned the flag field in your blog which must single-handedly have caused more trouble than any other format design decision ever made in bioinformatics! I can see the appeal of the format, but the field is still undergoing such rapid change I can see that it's probably not finished yet.

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Hi Andrew,

            Thanks for your comments. You raise some good points, but I don't agree with them all.

            (1) Editing of FASTQ files happens already though (quality trimming, filtering, etc) so there is no clear separation between primary data and derived data.

            (2) Given how big sequence data files are getting, it is increasingly impractical to work with them as plain text (not so bad for viruses though). You can do plenty with SAM at the Unix command line, the fact it is one line per read actually helps. Any non trivial stuff yes, a SAM/BAM library helps.

            (3) From a long term data archive policy going through all the SAM/BAM format revisions to try and understand what an old file means might be hard, but try extracting the meta data from a FASTQ file where there are 101 different filename, header or read naming conventions, many undocumented.

            (unnumbered 4) I agree the representation of the FLAG in SAM as a single (decimal) integer was probably the worst design choice in the format. Even an eight character string of 0s and 1s would have been easier to understand. However, it is done, and changing it will only break things - and only benefit people working on the files directly with scripts and Unix one-line magic. If you're using a SAM/BAM library this should map the FLAG bits for you.

            And I agree things will change (e.g. maybe one day we will see SAM/BAM move to HDF5 rather than the homegrown BGZF used now).

            Peter
            Last edited by maubp; 10-22-2011, 04:18 AM. Reason: Typo

            Comment

            • lh3
              Senior Member
              • Feb 2008
              • 686

              #7
              The major problem with fastq is we are unable to keep meta data. This is a disadvantage, not an advantage in almost all aspects. From this angle, SAM is at least not worse than fastq -- we can always keep the primary data only -- and SAM is arguably the only universal way to keep meta data. It is true that we may need to change SAM when a new technology comes with new read structures or new types of information, but other solutions are no better. We need to design something new anyway. Then why not just add to SAM? I do not know the decision process at Sanger and Broad about the use of BAM to store the primary data. I would guess the ability to keep meta data in BAM is a key.

              On the other hand, I do not see fastq dying. SAM/BAM is too heavy. Parsing SAM/BAM by yourself is really a pain especially in C. I know many will argue that a SAM/BAM library is available to each mainstream programming language. But there are developers like me who resist using a non-standard external library for something that is supposed to be simple and has little to do with the core algorithm. This is my philosophy of implementing algorithms, even if a bad one. In this line, it is easy to imagine my resistance to HDF5. And this resistance is not all about my personal opinion: BGZF indeed has several technical advantages over HDF5 which makes BGZF more suitable for SAM/BAM. Actually the simplicity of BGZF alone is strong enough to win me over.

              Back to the topic. SAM/BAM is good, but it is not for everything and for everyone. Fastq has its niche and will long live, if not outlive SAM/BAM.
              Last edited by lh3; 10-23-2011, 08:29 PM. Reason: fixed grammatical errors

              Comment

              • BAMseek
                Senior Member
                • Apr 2011
                • 124

                #8
                sequence storage interface

                One thing that I would like to see is a clear separation between the interface and the implementation of these sequence storage formats - similar to the relationship between graphics and OpenGL, for example. An interface that allows the user to extract certain information from the data with guaranteed time/space complexity bounds would help in hiding some of the details of the low level implementation. For example, as long as one could extract intervals that overlap a certain range, it wouldn't matter if it was done using UCSC binning scheme, augmented intervals, nested-containment lists, or something else with similar complexity behaviors.

                BAM/SAM could act as a model implementation of the interface and serve as a proof-of-concept that such an interface can be satisfied. This way, the tools that people write won't break when the implementation changes or if there is a switch to a new storage format.

                Comment

                • lh3
                  Senior Member
                  • Feb 2008
                  • 686

                  #9
                  That is like the sequence alignment APIs we were discussing. It is definitely a good thing, but I have never got time to do that for SAM/BAM.

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    Originally posted by lh3 View Post
                    The major problem with fastq is we are unable to keep meta data. This is a disadvantage, not an advantage in almost all aspects. From this angle, SAM is at least not worse than fastq -- we can always keep the primary data only -- and SAM is arguably the only universal way to keep meta data. It is true that we may need to change SAM when a new technology comes with new read structures or new types of information, but other solutions are no better. We need to design something new anyway. Then why not just add to SAM? I do not know the decision process at Sanger and Broad about the use of BAM to store the primary data. I would guess the ability to keep meta data in BAM is a key.
                    Here we agree. Maybe I should mention the Broad on the blog post too...

                    Originally posted by lh3 View Post
                    On the other hand, I do not see fastq dying. SAM/BAM is too heavy. Parsing SAM/BAM by yourself is really a pain especially in C. I know many will argue that a SAM/BAM library is available to each mainstream programming language. But there are developers like me who resist using a non-standard external library for something that is supposed to be simple and has little to do with the core algorithm. This is my philosophy of implementing algorithms, even if a bad one.
                    Here I do disagree with you - there is a time and a place for writing your own library functions, but in this example I think using a library for parsing SAM/BAM is very sensible - especially if it lets you spend more time on the core algorithm and less on the file IO.

                    Originally posted by lh3 View Post
                    In this line, it is easy to imagine my resistance to HDF5. And this resistance is not all about my personal opinion: BGZF indeed has several technical advantages over HDF5 which makes BGZF more suitable for SAM/BAM. Actually the simplicity of BGZF alone is strong enough to win me over.
                    I'm coming to like BGZF, and thinking about how to use it for other sequential (in the sense of one record after another) file formats like FASTA, FASTQ, GenBank etc. BGZF gives you almost as good compression as gzip, but makes random access much more efficient.

                    Originally posted by lh3 View Post
                    Back to the topic. SAM/BAM is good, but it is not for everything and for everyone. Fastq has its niche and will long live, if not outlive SAM/BAM.
                    I suspect you're right - but I would still like to see FASTQ replaced sooner rather than later
                    Last edited by maubp; 11-23-2011, 07:59 AM. Reason: Fixed autocorrection of Broad to Board.

                    Comment

                    • maubp
                      Peter (Biopython etc)
                      • Jul 2009
                      • 1544

                      #11
                      Originally posted by maubp View Post
                      I'm coming to like BGZF, and thinking about how to use it for other sequential (in the sense of one record after another) file formats like FASTA, FASTQ, GenBank etc. BGZF gives you almost as good compression as gzip, but makes random access much more efficient.
                      I've looked at this in more detail now, and think BGZF could be much more widely used, see this blog post and forum thread:
                      BAM files are compressed using a variant of GZIP (GNU ZIP) , called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specifica...

                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                      Comment

                      • RamakrishnanRS
                        Junior Member
                        • Oct 2012
                        • 9

                        #12
                        Where are we today?

                        Where do we stand on this today? If someone were to build a pipeline, what are the data points they should look at to decide between FASTQ and uBAM?

                        Most of all, file size concerns me. I no longer work on FASTQ, but when I did (1.5 years ago), they were 4-5 gigs, gzipped (WGS, 30X). I've never encountered uBAMs, but BAMs are 60+ gigs. Am I wrong comparing BAMs to uBAMs? Are the exponentially different in size? How would a WGS 30X uBAM compare in size to a FASTQ from the same experiment?
                        Ram

                        Comment

                        • GenoMax
                          Senior Member
                          • Feb 2008
                          • 7142

                          #13
                          I think we are right where we were when this thread started. Gzipped fastq files is still the most common deliverable for sequencing AFAIK. I believe PacBio has started moving to a variant of BAM with the new SMRTportal v.3.0 but no change in that direction from Illumina.

                          You are free to choose any format that suites your internal needs.

                          Comment

                          • Brian Bushnell
                            Super Moderator
                            • Jan 2014
                            • 2709

                            #14
                            I find gzipped fastq to be the most convenient. The sam/bam specification has a lot of limitations, like read 1 and read 2 having the same name. uBam is just what some random person decided to call "unmapped bam". They're still bam files.

                            Gzipped fastq is smaller and faster to process than unmapped bam. I just ran a test on 100k reads with these commands:

                            reformat.sh in=reads.fq.gz out=100k.fq.gz zl=6 ow reads=100k
                            reformat.sh in=reads.fq.gz out=100k_u.sam.gz zl=6 ow reads=100k
                            reformat.sh in=reads.fq.gz out=100k_u.bam zl=6 ow reads=100k

                            These are the sizes:

                            Code:
                            -rw-rw-r-- 1 bushnell genome 8784821 Nov 29 13:57 100k.fq.gz
                            -rw-rw-r-- 1 bushnell genome 9011991 Nov 29 13:58 100k_u.bam
                            -rw-rw-r-- 1 bushnell genome 8815867 Nov 29 13:57 100k_u.sam.gz
                            Write times:
                            fq.gz: 0.382 seconds
                            sam.gz: 0.400 seconds
                            bam: 1.958 seconds

                            Read times:
                            fq.gz: 0.304 seconds
                            sam.gz: 0.375 seconds
                            bam: 0.470 seconds

                            CPU-time (reading):
                            fq.gz: 1.438s
                            sam.gz: 1.431s
                            bam: 1.814s

                            So in addition to being inconvenient, unmapped bam is universally worse from a performance and space perspective.

                            Comment

                            • StackerEd
                              Junior Member
                              • May 2016
                              • 1

                              #15
                              sometimes you don't need alignments you need the raw reads, so long live FASTQ

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                Yesterday, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              18 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              52 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              110 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...