SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
problem indexing a bam file kjaja Bioinformatics 1 05-03-2012 11:51 AM
BAM to Bed Algorithm jarwulf Bioinformatics 1 04-30-2012 03:34 AM
Bam indexing problem? SeqVicious Bioinformatics 6 01-29-2012 09:12 AM
help me on reheader ".bam for indexing and mpileup jianfeng.mao Bioinformatics 3 12-17-2010 01:48 AM
Error indexing BAM file using samtools veena Bioinformatics 9 03-04-2010 03:52 AM

Reply
 
Thread Tools
Old 08-23-2013, 08:26 AM   #1
asiangg
Member
 
Location: New York

Join Date: Dec 2008
Posts: 44
Unhappy Can anyone explain BAM indexing algorithm to me?

I have been trying to understand how the BAM index works by reading the SAM specification (v1.4, the newest version) and even samtools' source code. However, I failed to understand it fully. I'm posting here and hopefully somebody may clear my mind.

The BAM index contains two types of indices: bin index and linear index. The bin size goes from 2^29 to 2^14, so the smallest bin represents a 16kb region on the genome. While the linear index contains the file offsets for each 16kb tiling window region on the genome, which coincides with the smallest bin size.

Inside each bin, there is also a smaller unit called chunk whose definition is missing in SAM specification. Can anybody explain to me how the chunk is defined?

It seems the BAM index uses a strategy to cluster short reads whose start coordinates are close to each other. Therefore, it does not have to index each read but rather a 16kb tiling window or a chunk. This results in a very small index file: most of my ".bai" files are smaller than 10MB. Therefore, we can comfortably hold the entire indexing info in primary memory.

Besides that, I really don't understand why the binning strategy is required here and how it compensate the linear index. Given a query region, we can easily figure out the start and end file offsets of the tiling windows, read all the data into primary memory and figure out the rest in memory. This can work because we can always assume the BAM file is sorted according to the begin coordinates. (We need two linear indices for this to work, one for begin coordinate of first and one for end coordinate of the last alignment)

We should note that the BAM index is different from the UCSC genome browser index. In UCSC GB, the index is built on the genomic features such as genes. The index is potentially very large and may reside on secondary storage. Therefore, using the binning strategy can help to shrink the search scope rapidly, resulting in less disk seeks and reads. However, for BAM index, the binning really looks like an overkill.

Can anyone explain?

Thanks!
asiangg is offline   Reply With Quote
Old 08-23-2013, 10:01 AM   #2
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Each "bin" is associated with one or more chunks. A chunk is an interval in BAM. Given a chunk with bin $b, most reads in this interval have the same bin $b. None of reads outside the interval have bin $b. By going through the chunks in the bin $b, you will get all the reads with bin $b.

UCSC is the first using the binning strategy, but it does not have the concept of "index file". Its index is just bare mysql index, which is by design sitting on disk not in memory. If we migrate the BAM-like index files for UCSC tables, it will be much smaller than the BAM index because UCSC uses fewer bins. I have not read the bigwig/bigbed paper about their index.

Binning index is not overkilling. You cannot assume BAM always keep short genomic reads. We also put contig and mRNA alignments in BAMs, which sometimes span >100kb. In that case, binning index will play an important role.
lh3 is offline   Reply With Quote
Old 08-23-2013, 11:08 AM   #3
asiangg
Member
 
Location: New York

Join Date: Dec 2008
Posts: 44
Default

Quote:
Originally Posted by lh3 View Post
Each "bin" is associated with one or more chunks. A chunk is an interval in BAM. Given a chunk with bin $b, most reads in this interval have the same bin $b. None of reads outside the interval have bin $b. By going through the chunks in the bin $b, you will get all the reads with bin $b.

UCSC is the first using the binning strategy, but it does not have the concept of "index file". Its index is just bare mysql index, which is by design sitting on disk not in memory. If we migrate the BAM-like index files for UCSC tables, it will be much smaller than the BAM index because UCSC uses fewer bins. I have not read the bigwig/bigbed paper about their index.

Binning index is not overkilling. You cannot assume BAM always keep short genomic reads. We also put contig and mRNA alignments in BAMs, which sometimes span >100kb. In that case, binning index will play an important role.
Thank you for explanation! However, the definition of chunk is still not clear. It seems a chunk is a smaller unit than bin but according to your description, the chunk may not necessary be within a bin. So what kind of criteria do you use to determine the interval for a chunk? If a chunk cross the boundary between two bins, do you keep the chunk in the bin where its start coordinate is contained?

I was making an assumption that samtools load the entire BAM index into primary memory in my previous post. I have not fully read and understood the source code of samtools so maybe I was wrong. But this does seem to be the case since the ".bai" file is so small. If this assumption is true, then I still do not understand why binning is necessary. With all the file offsets of tiling windows and trunks in primary memory, you can simply create two linear indices for the begin of the first alignment and the end of the last alignment for tiling windows or chunks. Then you can determine the start and end file offsets that enclose any query region and read through it. This has nothing to do with the length of an alignment. Am I right? Can you please offer some insight? Thx!
asiangg is offline   Reply With Quote
Old 08-23-2013, 11:33 AM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Say you have 3 reads, 100bp read1 at pos 1000, 100kb read2 at 1001 and 100bp read3 at 1002. Read1 and read3 are in the same bin $a, but read2 is in the parent bin of $a. The straightforward implementation will put two chunks in bin $a, the first chunk for read1 and the second for read2. Samtools is likely to use one chunk containing all the 3 reads. When you pull reads with bin $a, you go through the chunk and exclude read2 that has a different bin. It does not matter if a chunk contain a few more reads with different bins. By merging chunks close to each other, you get a smaller index file.

UCSC invented the binning index primarily because genes vary greatly in lengths which may make linear index inefficient. That is exactly the reason why BAM also uses binning index (and also why I talked about alignment lengths). In the extreme case, suppose you have a false RNA-seq alignment that spans the entire chromosome. Once there is a single alignment like this in your BAM, linear index will fail completely for that chr, as you always need to start from the beginning of the chromosome to seek to a position. With the binning index, such a problematic alignment only has limited effect.
lh3 is offline   Reply With Quote
Old 08-23-2013, 05:56 PM   #5
asiangg
Member
 
Location: New York

Join Date: Dec 2008
Posts: 44
Default

Quote:
Originally Posted by lh3 View Post
Say you have 3 reads, 100bp read1 at pos 1000, 100kb read2 at 1001 and 100bp read3 at 1002. Read1 and read3 are in the same bin $a, but read2 is in the parent bin of $a. The straightforward implementation will put two chunks in bin $a, the first chunk for read1 and the second for read2. Samtools is likely to use one chunk containing all the 3 reads. When you pull reads with bin $a, you go through the chunk and exclude read2 that has a different bin. It does not matter if a chunk contain a few more reads with different bins. By merging chunks close to each other, you get a smaller index file.

UCSC invented the binning index primarily because genes vary greatly in lengths which may make linear index inefficient. That is exactly the reason why BAM also uses binning index (and also why I talked about alignment lengths). In the extreme case, suppose you have a false RNA-seq alignment that spans the entire chromosome. Once there is a single alignment like this in your BAM, linear index will fail completely for that chr, as you always need to start from the beginning of the chromosome to seek to a position. With the binning index, such a problematic alignment only has limited effect.
Thank you again! I think I start to appreciate the use of binning. I forgot to consider the alignment that may spend a long distance on the genome. And I can also understand how the linear index acts to prevent reading those chunks whose end file offsets are smaller than the rbeg (first alignment).

However, I still have a concern. What about those chunks whose start file offsets are larger than rend (last alignment). You do not need to read those chunks either. But it seems BAM indexing only considers rbeg in linear index. Can you explain? Thx!
asiangg is offline   Reply With Quote
Old 08-26-2013, 09:28 AM   #6
asiangg
Member
 
Location: New York

Join Date: Dec 2008
Posts: 44
Default

Is the last question going to be answered? Or this is simply ignored in BAM indexing? I think the chance for a long alignment to be before the query start is as high as be after the query end. Does that mean quite a bit of disk seeks and reads are wasted?
asiangg is offline   Reply With Quote
Reply

Tags
bam file, bam index, binning

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO