Some sequencing facilities are storing the raw data as BAMs instead of FASTQs. Since it's faster and easier to generate FASTQs, there must be some benefit of converting to BAMs. I remember seeing some discussion about this before, but I can't find it anymore. Does anyone have a good list of pros and cons?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Pro: Bam has somewhat higher compression than gzipped fastq.
Con: It's much more irritating to use; most tools will require you to first convert it back to fastq first. It's also harder to stream-process because with paired reads, bam does not ensure that read 1 and read 2 are near each other or in any particular order, as opposed to paired fastq files. Also the read names get changed since in sam format read 1 and read 2 must have identical names, typically truncating things like the /1 /2 and bar code. And bam generally forces you to rely on 3rd-party software - whether samtools, picard, or something else - that sometimes have bugs as the sam/bam specifications keep changing, and are sometimes unclear. Speed- and memory-wise, the use of these programs can also be limiting, and out of your control. Whereas processing gzipped fastq relies only on gzip, which has been stable for a long time, is available on every platform, and uses a trivial amount of memory.
-
>Bam has somewhat higher compression than gzipped fastq.
I just checked some of my files and 9 .bam files I received from the sequencing facility are 21.2GB while the corresponding 18 .fq.gz files (R1 + R2 for each) that I made after converting with picard tools SamToFastq then gzip are 17.7GB. Is it unusual that the .gz is smaller than the .bam? The files all seemed fine when I aligned and processed them.
Comment
-
Both gzip and bam support variable compression; it's possible that the bam files were generated with a low compression setting.
Gzipped sam files with the same compression are almost always bigger than bam files; gzipped fastq may not have the same reduction, but for example I just converted a 64kb fastq file to gzip and to sam->bam. The gzipped file was 32039 bytes and the bam was 31034 bytes.
However, sometimes sequencing centers will cram all kinds of random annotation into the bam file which bloats it. You might look at a few reads from the bam file in sam format and see if they have optional fields. I guess that could be considered another advantage, though it could also be considered as just an opportunity for bloat.
Comment
-
Besides all the good points raised by Brian Bushnell, you might want to check if the BAM files do not already contain the aligned sequences.
My sequencing facility has switched recently from providing us the FASTQ files to providing us the aligned BAM files, which seems to be an (unfortunate?) trend.
You can just check the @PG line with the information about the program used to generate the BAM file on the last line of the header.
Code:samtools view -H name_file.bam
Comment
-
Originally posted by blancha View PostBesides all the good points raised by Brian Bushnell, you might want to check if the BAM files do not already contain the aligned sequences.
Comment
-
The argument given by my sequencing center for providing aligned BAM files is that it saves them some space, and allows them to compute metrics on the alignment.
It's a bit of a mess for us, because they just use BWA directly, and only for the reference genomes for which they have a pipeline set up.
They doesn't take into account that RNA-Seq data should be aligned with a splice-aware aligner, small RNA-Seq should be trimmed first, methyl-Seq data should be aligned to a bisulfite-converted reference genome, …
It's then my job to explain to our clients why sometimes they get FASTQ files whereas other times they get BAM files, and that even if they get BAM files they may have to repeat the alignment.
That being said, Picard tools SamToFastq.jar does work well, and, in some cases, I do agree that having the BAM files can save some time. More often than not though, it's an extra step in the pipeline to reconvert the BAM files to FASTQ files and repeat the alignment.
Comment
-
@blancha
I see you're in Montreal, and we've been using Genome Quebec and they are providing BWA-aligned .bam files similar to what you've described, same facility?
@GenoMax
If blancha and I are using the same facility, what they emailed is:
In order to provide additional quality control metrics and ready-to-use data sets, raw Illumina sequencing data delivered via Nanuq will be offered as BAM files instead of fastq files. BAM files have the same content as fastq files (reads + quality scores) but they also contain alignment information.
Starting April 21st 2014, release of sequencing data when the reference genome is “human” and “dna type” will only be available as BAM files. Other data sets using different reference genomes and “not dna type” will follow soon after that. The mapping software used to generate the BAM files is BWA. When no reference genome is available, files will remain in the fastq format.
In all cases, if fastq files are preferred to BAM files, they can always be regenerated by following the steps below since no information is removed: [2]https://biowiki.atlassian.net/wiki/display/MUG/Conversion+tools#Conversiontools-bamtofastq
Comment
-
A sorted BAM is sometimes smaller than the gzip'd fastq (depending on coverage), but the unsorted BAM is larger most of time. Some bioinfo cores prefer BAM because BAM keeps meta information, such as sample, lane, platform, run time, estimated insert size, read groups, barcode, etc. There are no standard ways to keep these in FASTQ. In addition, some groups do like to have aligned BAMs. Unified mapping procedure also helps data integration.
If I were a service provider, I would give two options: sorted BAM and unaligned BAM. When dealing with many data sets, keeping full meta info and raw data in one archive is a huge win. It is trivial to convert unaligned BAM to fastq. Many modern mappers also support read interleaved paired-end data from a stream, so no temporary files need to be created.
Comment
-
@GenoMax.
It was very much a unilateral decision. There were quite a few protests, but since they are the biggest sequencing center in Canada they can pretty much dictate their terms.
I was actually mistakenly using bedtools bamtofastq at first, which keeps both the primary and secondary alignments from BWA. It turns out it doesn't have much impact on the RNA-Seq results, but I was quite nervous when I realized this. To be fair, they had told me to use Picard tools' SamToFastq, which only keeps the primary alignments.
@biocomputer
Yes, it is Génome Québec. Now that I have lost some of my anonymity, I can't say anything bad about them.
To be honest, it isn't that big of a deal. The BAM files can easily be converted back to Fastq files, and in some cases it is quicker to use their alignments.
We were receiving so many complaints about the BAM files though, that the decision was made (not by me) not to specify the reference genome when sending them the samples. It's very devious, but they then are not able to do the alignments, and are forced to provide us the Fastq files. It's definitely not something I would recommend though, and I don't want to initiate a movement against sequencing centers giving us aligned BAM files by advising users not to give them the samples' species.
I do wish they had given the clients a choice though, if only to avoid me having to spend so much time on such a trivial issue.
Comment
-
Interesting to see this happening now - it seemed like a good idea three years ago:
I think it is time to retire the FASTQ file format in favour of storing unaligned reads in SAM/BAM format . I will try to explain, as thi...
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 11:49 AM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Today, 11:49 AM
|
||
Started by seqadmin, Yesterday, 08:47 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
Yesterday, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Comment