Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing two BAM files using SAMtools

    Hello, SEQanswers.

    This is my first time posting on this forum, I've been lurking for sometime now. I am currently attempting to compare specific chromosomes from different BAM files using the following command.
    Code:
    samtools view <file.bam>
    I'm curious as to what the .bai files are for also, and I'm unsure if I need these for accuracy purposes. In the end, all I really want to accomplish is comparing and view two different files simultaneously. Any help with this would be greatly appreciated.

  • #2
    The BAI files are optional indexes for BAM files used for efficient random access of coordinate sorted BAM files (e.g. Show me all the reads mapped to region 100 to 200 of chr7 is done via the BAI file to avoid scanning the entire BAM file).

    Be careful about the sorting when comparing two SAM or BAM files.

    Comment


    • #3
      You need to tell us what you mean by compare.

      Comment


      • #4
        Recommending Samscope

        I think I might have a toy for you! Go check out Samscope. You can then open both bam files with a command like:

        Code:
        samscope foo.bam bar.bam
        First you'll be looking at coverage data from foo.bam. Then hit w to open up a second window, which will be showing coverage data from bar.bam by default. Scroll and zoom and both windows will stay synchronized.

        Depending on what features you want to view and compare, you may or may not need the .bai index files. Samtools tview requires .bai indexes, and if you want to see individual reads, so does samscope. If you just want to see aggregate features like coverage, samscope will work fine without indexes.

        The .bai files are simply indexes of the .bam files that say basically "reads at chr1:4096 start at byte 18010 of the .bam file!" kind of thing. This allows finding all reads that map to a given position without plowing through the whole bam file searching. Note that the .bam file has to be sorted first for indexes to be created ("samtools sort foo.bam foo.sorted" will do that. Also, samscope has scripts to handle all that too).

        If you're hooked on samtools, you may want to consider the "samtools pileup" or "samtools tview" commands, but this can be hard to read/view for large regions. Tools like IGV or Tablet might also help.

        Comment


        • #5
          Originally posted by pbluescript View Post
          You need to tell us what you mean by compare.
          I apologize, like I said I'm new here. Okay, my boss (I'm a freshman at a University) asked me to learn how to use samtools to 'compare' two .bam files and analyze the indels (inserstions/deletions) of those files. I have the .bam files downloaded with their corresponding .bai files. All I'd really like to get accomplished is use something like the ./samtools view file.bam chr1:100-200 file2.bam chr1:100-200. Is there a command line to accomplish something similar to this?

          Comment


          • #6
            Originally posted by khoops66 View Post
            I apologize, like I said I'm new here. Okay, my boss (I'm a freshman at a University) asked me to learn how to use samtools to 'compare' two .bam files and analyze the indels (inserstions/deletions) of those files. I have the .bam files downloaded with their corresponding .bai files. All I'd really like to get accomplished is use something like the ./samtools view file.bam chr1:100-200 file2.bam chr1:100-200. Is there a command line to accomplish something similar to this?
            Use samtools mpileup with the -r option to specify your region. This will give you the indels and snps in that region of interest.

            From there it depends on exactly how you want to compare the two bam files. If you want to see which variants are in one bam and not the other, then you will have to run mpileup individually on each bam, then compare the output using something like bedtools.

            Comment


            • #7
              Originally posted by chadn737 View Post
              Use samtools mpileup with the -r option to specify your region. This will give you the indels and snps in that region of interest.

              From there it depends on exactly how you want to compare the two bam files. If you want to see which variants are in one bam and not the other, then you will have to run mpileup individually on each bam, then compare the output using something like bedtools.
              This will also allow me to compare the indels of two files at the same time? I'd like to look at chr1:554200-554700 of both of the following bam files for indels.

              Comment


              • #8
                Originally posted by khoops66 View Post
                This will also allow me to compare the indels of two files at the same time? I'd like to look at chr1:554200-554700 of both of the following bam files for indels.
                Specify what you mean by compare? Do you want to find indels in common between the two samples? Do you want to find indels that differ between the two samples? Do you want to simply find all indels in both files?

                Specify very clearly what you want to accomplish, exactly what sort of output you want, and its much easier.

                Comment


                • #9
                  Yeah, like another poster said, you can do this with samtools mpileup. Though you can analyze both samples together, along with the reference fasta. SAMTools mpileup will take multiple .bams as input, and the output vcf you eventually make will have one column for each sample. You can also tell samtools mpileup to only make its files for that region of that chromosome.

                  All this assumes that the .bam was made with software that will actually call indels.

                  Though with such a small project, you could also eyeball. You can use the Broad's IGV to just look at the .bams in this region. You'll see discrepancies between the read and the reference, and you can load multiple .bams at once. (espeically if you use samtools view to trim them down to just the region you want).

                  Comment


                  • #10
                    Originally posted by chadn737 View Post
                    Specify what you mean by compare? Do you want to find indels in common between the two samples? Do you want to find indels that differ between the two samples? Do you want to simply find all indels in both files?

                    Specify very clearly what you want to accomplish, exactly what sort of output you want, and its much easier.
                    I'd like to find the similarities and differences between the two files. And since I only need to use chr1 of the two files I'm attempting to cut these down in size as well.

                    Comment


                    • #11
                      Originally posted by khoops66 View Post
                      I'd like to find the similarities and differences between the two files. And since I only need to use chr1 of the two files I'm attempting to cut these down in size as well.
                      Do you want to get coverage, and error rate beween the two .bams too?

                      If you want to call SNPs and indels, you'll also need the reference sequence that was used to align the files. The exact file is best, if possible, because the names of the chromosomes need to be a match to the names in your .bam

                      You can use 'samtools view' to winnow down your .bams to just the region you want. Or, you can tell 'samtools mpileup' what region to look at.

                      Have you even looked at the command line options for those programs?

                      You will get much better responses here if you demonstrate that you've spent 5 minutes trying to do this on your own. You do that by saying "This is the command line I tried, this is what the file I generated looks like, these are the things I looked at to try and figure out why it's not working the way I want it to". Basically, you need to demonstrate that you have spent at least as much time really trying to work it out as you expect some stranger to spend helping you.

                      Even if people wanted to hold your hand and tell you exactly what to do, they can't do that if you don't say exactly what you are trying to do. "I want to find all the similarities and differences" is too vague. For all we know, you are looking for CNVs, or trying to calculate capture efficiency. "I'm looking for SNPs and small indels" is a lot more precise.

                      Comment


                      • #12
                        Originally posted by swbarnes2 View Post
                        You will get much better responses here if you demonstrate that you've spent 5 minutes trying to do this on your own. You do that by saying "This is the command line I tried, this is what the file I generated looks like, these are the things I looked at to try and figure out why it's not working the way I want it to". Basically, you need to demonstrate that you have spent at least as much time really trying to work it out as you expect some stranger to spend helping you.

                        Even if people wanted to hold your hand and tell you exactly what to do, they can't do that if you don't say exactly what you are trying to do. "I want to find all the similarities and differences" is too vague. For all we know, you are looking for CNVs, or trying to calculate capture efficiency. "I'm looking for SNPs and small indels" is a lot more precise.

                        Well, I'm a freshman CS major and I'm doing a work study. I don't mean to plead ignorance, and I apologize for being vague earlier. I am just trying to get this all figured out. I'm reading research papers and documentation for different programs that I've never used. I truly appreciate your help, though. Thank you.

                        I have a work meeting shortly and hope to get a better understanding of what it is we're looking for in specific from these files. Again, thank you for your patience, I don't mean to be a bother.

                        Comment


                        • #13
                          Parallel discussion (didn't realize reddit was turning into a place for this, hmmn):

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          66 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X