Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Possible samtools view bug

    I'm using samtools to extract some regions from a BAM file, and in some situations, I don't get back all of the reads that overlap the region.

    For example:
    Looking for all reads that overlap the region 2L:1376266-1376270 from this bam file (61MB). Using this command:

    samtools view 801.2L.bam 2L:1376266-1376270

    Samtools finds these these 14 reads with the command. However, looking at the area around that region (here's a slightly bigger area), I can see several more reads that overlap this region that aren't printed out by samtools.

    The samtools view --region appears to work fine for most other regions in the file; only a small minority exhibit this behaviour.

    Is this a bug in samtools, or am I missing something?

  • #2
    I concur that with your sample file
    Code:
    samtools view 801.2L.bam 2L:1376266-1376270
    only returns 14 reads (using samtools 0.1.18).

    Could you quote one or two specific reads from that slightly bigger area which you think are missing?

    Comment


    • #3
      I was looking at this too and here is one read that gets excluded but seems it should not be excluded:

      USI-EAS376_0001:1:80:851:1058#0 147 2L 1376256 29 95M

      Comment


      • #4
        So that maps to position 1376256 (1 based SAM counting) with a CIGAR of 95M, so it ends at 1376256 + 95 - 1 = 1376350 (again, 1 based counting).

        That means USI-EAS376_0001:1:80:851:1058#0 spans 1376256-1376350, and so does appears to cover in the window you requested, 2L:1376266-1376270, in fact it fully contains that region.

        The FLAG is 0x1 + 0x2 + 0x10 + 0x80 = 0x93 = 147. That means multi-part/paired (0x1), pair properly mapped (0x2), partner on reverse strand (0x10), and second read in the pair (0x80). That seems fine.

        You seem to be right, something odd here...

        Comment


        • #5
          On the off chance it was a problem in the sorting, I tried this:
          Code:
          $ samtools sort 801.2L.bam 801.2L.resort
          $ samtools index 801.2L.resort.bam
          $ samtools view 801.2L.resort.bam 2L:1376266-1376270 | wc -l
          14
          No change. Then, in case there was a problem in the binary data, I tried doing BAM to SAM to BAM,

          Code:
          $ samtools view -h 801.2L.bam | samtools view -b -S - > 801.2L.rebam.bam
          [samopen] SAM header is present: 14 sequences.
          $ samtools index 801.2L.rebam.bam 
          $ samtools view 801.2L.rebam.bam 2L:1376266-1376270 | wc -l
                63
          $ samtools view 801.2L.rebam.bam 2L:1376266-1376270 | grep "USI-EAS376_0001:1:80:851:1058#0"
          USI-EAS376_0001:1:80:851:1058#0	147	2L	1376256	29	95M	=	1376143	-208	TTCTAGGAAATAAATGTATTTCTTCGTCTAATGCCCCACATTATCCGTCACCTTTCTTTGATTTATAGCTGGCTAATCAAAGGACTCAATCAGCC	##BGB..>@DGF?GEAA;@>=FA=FEB@FF3BFHHEH?HGFFGAGHEFEBDCEHHGDGD=GHEHFHHGHHF@@FFEHDHH55555BHEHHEFGBH	X0:i:1	X1:i:0	MD:Z:95	RG:Z:foo	XG:i:0	AM:i:29	NM:i:0	SM:i:29	XM:i:0	XO:i:0	XT:A:U
          That seems to have fixed something... where did you get this BAM file from? Perhaps the upstream tool is making a subtly broken BAM file?

          Comment


          • #6
            Originally posted by maubp View Post
            in case there was a problem in the binary data, I tried doing BAM to SAM to BAM,that seems to have fixed something.
            Maubp, good idea. I ran the same commands, gunzip'ed the resulting BAM files, and then compared the binary data inside. The only differences between the working and nonworking files was that the bin field was different in some reads, which would cause this problem. Thanks for your help!

            Comment


            • #7
              That makes sense - it was one of the possible explanations I had in mind, but I didn't bother to examine the binary data to check. Well done

              The SAM file format doesn't have a BIN field, it is automatically calculated from the alignment position when converting to BAM. The BIN field is used as part of the indexing, so if your original file has some bad BIN values, those reads would not appear correctly via the index.

              So again, where did the bad BAM file come from? It seems likely that the tool that created it is generating bad BIN values. This should be fixed to prevent other people suffering silent data loss.

              Perhaps also samtools could verify the BIN values as some point (e.g. in 'samtools index'), but I suspect that could make it too slow.

              Comment


              • #8
                The BAM file came from an unreleased development version of OpenGE. We have been developing faster multithreaded BAM IO code, and evidently still have a bug or two left to resolve. This branch hasn't been released, and won't be until more testing has been performed.

                It would be pretty easy to verify the bin number when indexing a new file- the calculations are pretty simple, and should be quick to execute. Implementing this should be a fairly simple patch for samtools. On the other hand, how often do you run into a BAM file with bad BIN numbers?

                Thanks again for your help!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin


                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                  Yesterday, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                39 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                41 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                35 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                55 views
                0 likes
                Last Post seqadmin  
                Working...
                X