PDA

View Full Version : storing raw sequences as BAMs versus FASTQs


id0
12-09-2014, 06:51 AM
Some sequencing facilities are storing the raw data as BAMs instead of FASTQs. Since it's faster and easier to generate FASTQs, there must be some benefit of converting to BAMs. I remember seeing some discussion about this before, but I can't find it anymore. Does anyone have a good list of pros and cons?

Brian Bushnell
12-09-2014, 09:50 AM
Pro: Bam has somewhat higher compression than gzipped fastq.

Con: It's much more irritating to use; most tools will require you to first convert it back to fastq first. It's also harder to stream-process because with paired reads, bam does not ensure that read 1 and read 2 are near each other or in any particular order, as opposed to paired fastq files. Also the read names get changed since in sam format read 1 and read 2 must have identical names, typically truncating things like the /1 /2 and bar code. And bam generally forces you to rely on 3rd-party software - whether samtools, picard, or something else - that sometimes have bugs as the sam/bam specifications keep changing, and are sometimes unclear. Speed- and memory-wise, the use of these programs can also be limiting, and out of your control. Whereas processing gzipped fastq relies only on gzip, which has been stable for a long time, is available on every platform, and uses a trivial amount of memory.

biocomputer
12-09-2014, 11:43 AM
>Bam has somewhat higher compression than gzipped fastq.

I just checked some of my files and 9 .bam files I received from the sequencing facility are 21.2GB while the corresponding 18 .fq.gz files (R1 + R2 for each) that I made after converting with picard tools SamToFastq then gzip are 17.7GB. Is it unusual that the .gz is smaller than the .bam? The files all seemed fine when I aligned and processed them.

Brian Bushnell
12-09-2014, 11:52 AM
Both gzip and bam support variable compression; it's possible that the bam files were generated with a low compression setting.

Gzipped sam files with the same compression are almost always bigger than bam files; gzipped fastq may not have the same reduction, but for example I just converted a 64kb fastq file to gzip and to sam->bam. The gzipped file was 32039 bytes and the bam was 31034 bytes.

However, sometimes sequencing centers will cram all kinds of random annotation into the bam file which bloats it. You might look at a few reads from the bam file in sam format and see if they have optional fields. I guess that could be considered another advantage, though it could also be considered as just an opportunity for bloat.

blancha
12-09-2014, 11:59 AM
Besides all the good points raised by Brian Bushnell, you might want to check if the BAM files do not already contain the aligned sequences.

My sequencing facility has switched recently from providing us the FASTQ files to providing us the aligned BAM files, which seems to be an (unfortunate?) trend.

You can just check the @PG line with the information about the program used to generate the BAM file on the last line of the header.


samtools view -H name_file.bam

biocomputer
12-09-2014, 01:20 PM
Besides all the good points raised by Brian Bushnell, you might want to check if the BAM files do not already contain the aligned sequences.


The .bam files I received were already aligned I didn't realize an unaligned .bam file is something people use.

blancha
12-09-2014, 01:31 PM
The argument given by my sequencing center for providing aligned BAM files is that it saves them some space, and allows them to compute metrics on the alignment.

It's a bit of a mess for us, because they just use BWA directly, and only for the reference genomes for which they have a pipeline set up.
They doesn't take into account that RNA-Seq data should be aligned with a splice-aware aligner, small RNA-Seq should be trimmed first, methyl-Seq data should be aligned to a bisulfite-converted reference genome, …

It's then my job to explain to our clients why sometimes they get FASTQ files whereas other times they get BAM files, and that even if they get BAM files they may have to repeat the alignment.

That being said, Picard tools SamToFastq.jar does work well, and, in some cases, I do agree that having the BAM files can save some time. More often than not though, it's an extra step in the pipeline to reconvert the BAM files to FASTQ files and repeat the alignment.

GenoMax
12-09-2014, 03:06 PM
@blancha: It is surprising that your facility is forcing you to accept data in a format that is not plain sequence files. Is this something spelled out in their deliverables list or did they just start doing this unilaterally?

biocomputer
12-09-2014, 03:36 PM
@blancha

I see you're in Montreal, and we've been using Genome Quebec and they are providing BWA-aligned .bam files similar to what you've described, same facility?

@GenoMax

If blancha and I are using the same facility, what they emailed is:

In order to provide additional quality control metrics and ready-to-use data sets, raw Illumina sequencing data delivered via Nanuq will be offered as BAM files instead of fastq files. BAM files have the same content as fastq files (reads + quality scores) but they also contain alignment information.

Starting April 21st 2014, release of sequencing data when the reference genome is “human” and “dna type” will only be available as BAM files. Other data sets using different reference genomes and “not dna type” will follow soon after that. The mapping software used to generate the BAM files is BWA. When no reference genome is available, files will remain in the fastq format.

In all cases, if fastq files are preferred to BAM files, they can always be regenerated by following the steps below since no information is removed: [2]https://biowiki.atlassian.net/wiki/display/MUG/Conversion+tools#Conversiontools-bamtofastq

I think it just creates more work since I assume most people have their own alignment and processing pipeline and will want to realign from fastq anyways. For me that is the case.

lh3
12-09-2014, 05:19 PM
A sorted BAM is sometimes smaller than the gzip'd fastq (depending on coverage), but the unsorted BAM is larger most of time. Some bioinfo cores prefer BAM because BAM keeps meta information, such as sample, lane, platform, run time, estimated insert size, read groups, barcode, etc. There are no standard ways to keep these in FASTQ. In addition, some groups do like to have aligned BAMs. Unified mapping procedure also helps data integration.

If I were a service provider, I would give two options: sorted BAM and unaligned BAM. When dealing with many data sets, keeping full meta info and raw data in one archive is a huge win. It is trivial to convert unaligned BAM to fastq. Many modern mappers also support read interleaved paired-end data from a stream, so no temporary files need to be created.

blancha
12-09-2014, 07:07 PM
@GenoMax.
It was very much a unilateral decision. There were quite a few protests, but since they are the biggest sequencing center in Canada they can pretty much dictate their terms.

I was actually mistakenly using bedtools bamtofastq at first, which keeps both the primary and secondary alignments from BWA. It turns out it doesn't have much impact on the RNA-Seq results, but I was quite nervous when I realized this. To be fair, they had told me to use Picard tools' SamToFastq, which only keeps the primary alignments.

@biocomputer
Yes, it is Génome Québec. Now that I have lost some of my anonymity, I can't say anything bad about them. :)
To be honest, it isn't that big of a deal. The BAM files can easily be converted back to Fastq files, and in some cases it is quicker to use their alignments.

We were receiving so many complaints about the BAM files though, that the decision was made (not by me) not to specify the reference genome when sending them the samples. It's very devious, but they then are not able to do the alignments, and are forced to provide us the Fastq files. It's definitely not something I would recommend though, and I don't want to initiate a movement against sequencing centers giving us aligned BAM files by advising users not to give them the samples' species.

I do wish they had given the clients a choice though, if only to avoid me having to spend so much time on such a trivial issue.

maubp
12-11-2014, 03:59 AM
Interesting to see this happening now - it seemed like a good idea three years ago:
http://blastedbio.blogspot.co.uk/2011/10/fastq-must-die-long-live-sambam.html
http://seqanswers.com/forums/showthread.php?t=14941