Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can anyone explain BAM indexing algorithm to me?

    I have been trying to understand how the BAM index works by reading the SAM specification (v1.4, the newest version) and even samtools' source code. However, I failed to understand it fully. I'm posting here and hopefully somebody may clear my mind.

    The BAM index contains two types of indices: bin index and linear index. The bin size goes from 2^29 to 2^14, so the smallest bin represents a 16kb region on the genome. While the linear index contains the file offsets for each 16kb tiling window region on the genome, which coincides with the smallest bin size.

    Inside each bin, there is also a smaller unit called chunk whose definition is missing in SAM specification. Can anybody explain to me how the chunk is defined?

    It seems the BAM index uses a strategy to cluster short reads whose start coordinates are close to each other. Therefore, it does not have to index each read but rather a 16kb tiling window or a chunk. This results in a very small index file: most of my ".bai" files are smaller than 10MB. Therefore, we can comfortably hold the entire indexing info in primary memory.

    Besides that, I really don't understand why the binning strategy is required here and how it compensate the linear index. Given a query region, we can easily figure out the start and end file offsets of the tiling windows, read all the data into primary memory and figure out the rest in memory. This can work because we can always assume the BAM file is sorted according to the begin coordinates. (We need two linear indices for this to work, one for begin coordinate of first and one for end coordinate of the last alignment)

    We should note that the BAM index is different from the UCSC genome browser index. In UCSC GB, the index is built on the genomic features such as genes. The index is potentially very large and may reside on secondary storage. Therefore, using the binning strategy can help to shrink the search scope rapidly, resulting in less disk seeks and reads. However, for BAM index, the binning really looks like an overkill.

    Can anyone explain?

    Thanks!

  • #2
    Each "bin" is associated with one or more chunks. A chunk is an interval in BAM. Given a chunk with bin $b, most reads in this interval have the same bin $b. None of reads outside the interval have bin $b. By going through the chunks in the bin $b, you will get all the reads with bin $b.

    UCSC is the first using the binning strategy, but it does not have the concept of "index file". Its index is just bare mysql index, which is by design sitting on disk not in memory. If we migrate the BAM-like index files for UCSC tables, it will be much smaller than the BAM index because UCSC uses fewer bins. I have not read the bigwig/bigbed paper about their index.

    Binning index is not overkilling. You cannot assume BAM always keep short genomic reads. We also put contig and mRNA alignments in BAMs, which sometimes span >100kb. In that case, binning index will play an important role.

    Comment


    • #3
      Originally posted by lh3 View Post
      Each "bin" is associated with one or more chunks. A chunk is an interval in BAM. Given a chunk with bin $b, most reads in this interval have the same bin $b. None of reads outside the interval have bin $b. By going through the chunks in the bin $b, you will get all the reads with bin $b.

      UCSC is the first using the binning strategy, but it does not have the concept of "index file". Its index is just bare mysql index, which is by design sitting on disk not in memory. If we migrate the BAM-like index files for UCSC tables, it will be much smaller than the BAM index because UCSC uses fewer bins. I have not read the bigwig/bigbed paper about their index.

      Binning index is not overkilling. You cannot assume BAM always keep short genomic reads. We also put contig and mRNA alignments in BAMs, which sometimes span >100kb. In that case, binning index will play an important role.
      Thank you for explanation! However, the definition of chunk is still not clear. It seems a chunk is a smaller unit than bin but according to your description, the chunk may not necessary be within a bin. So what kind of criteria do you use to determine the interval for a chunk? If a chunk cross the boundary between two bins, do you keep the chunk in the bin where its start coordinate is contained?

      I was making an assumption that samtools load the entire BAM index into primary memory in my previous post. I have not fully read and understood the source code of samtools so maybe I was wrong. But this does seem to be the case since the ".bai" file is so small. If this assumption is true, then I still do not understand why binning is necessary. With all the file offsets of tiling windows and trunks in primary memory, you can simply create two linear indices for the begin of the first alignment and the end of the last alignment for tiling windows or chunks. Then you can determine the start and end file offsets that enclose any query region and read through it. This has nothing to do with the length of an alignment. Am I right? Can you please offer some insight? Thx!

      Comment


      • #4
        Say you have 3 reads, 100bp read1 at pos 1000, 100kb read2 at 1001 and 100bp read3 at 1002. Read1 and read3 are in the same bin $a, but read2 is in the parent bin of $a. The straightforward implementation will put two chunks in bin $a, the first chunk for read1 and the second for read2. Samtools is likely to use one chunk containing all the 3 reads. When you pull reads with bin $a, you go through the chunk and exclude read2 that has a different bin. It does not matter if a chunk contain a few more reads with different bins. By merging chunks close to each other, you get a smaller index file.

        UCSC invented the binning index primarily because genes vary greatly in lengths which may make linear index inefficient. That is exactly the reason why BAM also uses binning index (and also why I talked about alignment lengths). In the extreme case, suppose you have a false RNA-seq alignment that spans the entire chromosome. Once there is a single alignment like this in your BAM, linear index will fail completely for that chr, as you always need to start from the beginning of the chromosome to seek to a position. With the binning index, such a problematic alignment only has limited effect.

        Comment


        • #5
          Originally posted by lh3 View Post
          Say you have 3 reads, 100bp read1 at pos 1000, 100kb read2 at 1001 and 100bp read3 at 1002. Read1 and read3 are in the same bin $a, but read2 is in the parent bin of $a. The straightforward implementation will put two chunks in bin $a, the first chunk for read1 and the second for read2. Samtools is likely to use one chunk containing all the 3 reads. When you pull reads with bin $a, you go through the chunk and exclude read2 that has a different bin. It does not matter if a chunk contain a few more reads with different bins. By merging chunks close to each other, you get a smaller index file.

          UCSC invented the binning index primarily because genes vary greatly in lengths which may make linear index inefficient. That is exactly the reason why BAM also uses binning index (and also why I talked about alignment lengths). In the extreme case, suppose you have a false RNA-seq alignment that spans the entire chromosome. Once there is a single alignment like this in your BAM, linear index will fail completely for that chr, as you always need to start from the beginning of the chromosome to seek to a position. With the binning index, such a problematic alignment only has limited effect.
          Thank you again! I think I start to appreciate the use of binning. I forgot to consider the alignment that may spend a long distance on the genome. And I can also understand how the linear index acts to prevent reading those chunks whose end file offsets are smaller than the rbeg (first alignment).

          However, I still have a concern. What about those chunks whose start file offsets are larger than rend (last alignment). You do not need to read those chunks either. But it seems BAM indexing only considers rbeg in linear index. Can you explain? Thx!

          Comment


          • #6
            Is the last question going to be answered? Or this is simply ignored in BAM indexing? I think the chance for a long alignment to be before the query start is as high as be after the query end. Does that mean quite a bit of disk seeks and reads are wasted?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X