Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare two BAM file

    Hello everyone,

    I have some questions about BAM file.
    I used samtools to convert BAM file to CRAM, then I used samtools convert CRAM file back to BAM. The size of the new BAM file is larger than the old one. Why the BAM file increase the size? I also want to compare the two BAM files to see are they different or not. Thank you.

  • #2
    Have you tried comparing md5sums (after freshly sorting the files)?

    Are both versions on the same storage partition?

    Comment


    • #3
      Originally posted by GenoMax View Post
      Have you tried comparing md5sums (after freshly sorting the files)?

      Are both versions on the same storage partition?
      I didn't check md5sum when I posted the question. But the size increase significantly.
      For example:
      Old BAM Size(byte) | New BAM Size(byte) | Space Increase(byte) | %
      8398197 | 8442685 | 44488 | 0.53
      1595537237 | 1605593870 | 10056633 | 0.63
      3944613 | 4031703 | 87090 | 2.21
      1975987646 | 1989339186 | 13351540 | 0.68

      From your suggest, I check md5sums they are different. The old and new BAM files in the same directory.

      Comment


      • #4
        Try piping both as SAM to md5sum.

        Comment


        • #5
          If I understand correctly, the md5 should be different anyway even if the two bam file are of the same size? Because they are created by different programs.

          Comment


          • #6
            Originally posted by gandalf886 View Post
            If I understand correctly, the md5 should be different anyway even if the two bam file are of the same size? Because they are created by different programs.
            Do you have any idea? or Why the two BAM files are different? I found a blog mention something similar about the size of BAM file. http://davetang.org/muse/2014/09/26/bam-to-cram/

            Comment


            • #7
              As Devon said, compare them as sam files.

              Convert both of them to sam and compare them. You can't get trustworthy results from comparing compressed files. If they both yield identical sam files, they are identical. If not, they differ.

              Comment


              • #8
                There is a compare_sam (not bam, but that's a trivial conversion) script in the htslib test harness.



                It's pretty crufty and not really designed for use outside the test system, but if you're familiar with perl then you should be able to drive the different options easily enough.

                As for why the size is larger, it may come down to NM and MD auxiliary tags. These are not stored verbatim in CRAM, but worked out on-the-fly from the sequence delta to the reference. It doesn't store a flag to indicate whether there were present in the original input data, so by default you'll get them whether you wanted them or not.

                Comment


                • #9
                  Thank you jkbonfield and everyone. I used the script jkbonfield suggested to compare two SAM file. The one generated from CRAM had extra MD tags and that caused the file increased its size.

                  Originally posted by jkbonfield View Post
                  There is a compare_sam (not bam, but that's a trivial conversion) script in the htslib test harness.



                  It's pretty crufty and not really designed for use outside the test system, but if you're familiar with perl then you should be able to drive the different options easily enough.

                  As for why the size is larger, it may come down to NM and MD auxiliary tags. These are not stored verbatim in CRAM, but worked out on-the-fly from the sequence delta to the reference. It doesn't store a flag to indicate whether there were present in the original input data, so by default you'll get them whether you wanted them or not.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  26 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X