Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare mapped reads in SAM/BAM files

    Hello,

    I have two different SAM/BAM files for a sequenced library (Illumina library sequenced paired-end) which were generated by mapping the reads (with Bowtie2) to two different reference genomes. I would like to see how many mapped reads (can be mapped concordant or discordant, doesn't matter) are shared between the two alignment files and how many mapped reads are unique to one file or the other.

    Is there a tool available to do this, possibly in samtools? If not, does anyone have some suggestions to get me started on the right track?

    Thank you!

  • #2
    If you can get the read names into two R character vectors, you could use the various set operations like setdiff.
    There's probably a much more elegant (and faster) way to do this, but quick and dirty:

    library("GenomicAlignments");
    reads_alnmt1 <- names(readGAlignments(bam_file1, use.names=TRUE));
    reads_alnmt2 <- names(readGAlignments(bam_file2, use.names=TRUE));

    unique1 <- setdiff(reads_alnmt1, reads_alnmt2);
    unique2 <- setdiff(reads_alnmt2, reads_alnmt1);

    Comment


    • #3
      It's possible to do so with BBTools like this:

      reformat.sh in=file1.sam out=mapped1.sam mappedonly
      reformat.sh in=file2.sam out=mapped2.sam mappedonly


      That gets you the mapped reads only. Then:

      filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t

      ...which gets you the set intersection;

      filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f
      filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f


      ...which get you the set subtractions.

      Comment


      • #4
        This is a simple enough sort of thing to write in python with pysam or C with htslib that I'd personally just do that. For example, here is a small C program that will accept two BAM files and tell you which alignments have different mapping positions or CIGAR strings. It'd be simple to modify that to compare only mapped vs unmapped and keep a count.

        Note: That's not production code, just an ad hoc solution with no error checking.
        Last edited by dpryan; 02-11-2015, 12:18 AM.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          It's possible to do so with BBTools like this:

          reformat.sh in=file1.sam out=mapped1.sam mappedonly
          reformat.sh in=file2.sam out=mapped2.sam mappedonly


          That gets you the mapped reads only. Then:

          filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t

          ...which gets you the set intersection;

          filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f
          filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f


          ...which get you the set subtractions.
          Hi Brian,

          Thanks again for the help. I am currently running the program now and it gave me an error message:

          Code:
          java -ea -Xmx200m -cp /home/kianians/millerm2/scripts/bbmap/current/ jgi.ReformatReads in=1_trimq10_100bpminlen_PE_cs_mt.sam out=1_trimq10_100bpminlen_PE_cs_mt_mapped.sam mappedonly
          Executing jgi.ReformatReads [in=1_trimq10_100bpminlen_PE_cs_mt.sam, out=1_trimq10_100bpminlen_PE_cs_mt_mapped.sam, mappedonly]
          
          Input is being processed as unpaired
          Exception in thread "Thread-3" java.lang.AssertionError: Sam format cannot be used as input to this program when no genome build is loaded.
          Please index the reference first and rerun with e.g. 'build=1', or use a different input format.
          	at stream.SamLine.<init>(SamLine.java:138)
          	at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:117)
          	at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:19)
          The program is still running so I'm not sure if it will complete normally or not. An output file has been created but it is 0 bytes currently.

          Comment


          • #6
            Hi Marisa,

            I think you're using an old version; Reformat did not use to be able to input and output sam files, but it can now. Just kill that and rerun it with the latest version (34.46).

            -Brian
            Last edited by Brian Bushnell; 02-11-2015, 10:23 AM.

            Comment


            • #7
              Originally posted by Brian Bushnell View Post
              Hi Marisa,

              I think you're using an old version; Reformat did not use to be able to input and output sam files, but it can now. Just kill that and rerun it with the latest version (34.46).

              -Brian
              Hi Brian,

              I downloaded the new version and I was able to successfully filter out the mapped reads only.

              When I try to get the intersection between the two files I am getting this error:

              Code:
              Exception in thread "main" java.lang.AssertionError: Tried to make a SamLine from a header: @HD
              	at stream.SamLine.<init>(SamLine.java:420)
              	at stream.SamLine.<init>(SamLine.java:49)
              	at driver.FilterReadsByName.<init>(FilterReadsByName.java:144)
              	at driver.FilterReadsByName.main(FilterReadsByName.java:41)
              Can I fix this by removing the headers from the sam files containing only mapped reads?

              Thank you,
              Marisa

              Comment


              • #8
                Wow, I'm just full of fail! Yes, that would work. I will fix that bug in the next release.

                Comment


                • #9
                  Originally posted by Brian Bushnell View Post
                  Wow, I'm just full of fail! Yes, that would work. I will fix that bug in the next release.
                  Sorry to bug you with another bug! I thought filterbyname.sh was working fine but I realized made an error in my command. Due to a typing error I was using a file that doesn't exist as my "names" file for filterbyname.sh. I realized I can input any made up file name for the "names" file, but if I put in a made up or wrong file name for the "in" file I will get an error message saying no such file exists.

                  When I run the command using both the "in" and "names" sam files with no headers, I am now getting the following error:

                  Code:
                   java -ea -Xmx5534m -cp /home/kianians/millerm2/scripts/bbmap/current/ driver.FilterReadsByName in=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/mitochondria/chinese_spring/2_cs_mt_mapped_noheader.sam names=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/chloroplast/chinese_spring/2_cs_chlr_mapped_noheader.sam out=2_cs_mt_chlr_shared.sam include=t
                  Executing driver.FilterReadsByName [in=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/mitochondria/chinese_spring/2_cs_mt_mapped_noheader.sam, names=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/chloroplast/chinese_spring/2_cs_chlr_mapped_noheader.sam, out=2_cs_mt_chlr_shared.sam, include=t]
                  
                  Input is being processed as unpaired
                  Exception in thread "Thread-3" java.lang.AssertionError: Sam format cannot be used as input to this program when no genome build is loaded.
                  Please index the reference first and rerun with e.g. 'build=1', or use a different input format.
                  	at stream.SamLine.<init>(SamLine.java:93)
                  	at stream.ReadStreamByteWriter.writeSam(ReadStreamByteWriter.java:442)
                  	at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:86)
                  	at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:41)
                  	at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:27)
                  Exception in thread "main" java.lang.RuntimeException: Writing to a terminated thread.
                  	at stream.ConcurrentGenericReadOutputStream.write(ConcurrentGenericReadOutputStream.java:198)
                  	at stream.ConcurrentGenericReadOutputStream.addDisordered(ConcurrentGenericReadOutputStream.java:193)
                  	at stream.ConcurrentGenericReadOutputStream.add(ConcurrentGenericReadOutputStream.java:97)
                  	at driver.FilterReadsByName.process(FilterReadsByName.java:354)
                  	at driver.FilterReadsByName.main(FilterReadsByName.java:42)
                  Any suggestions? And thanks again for all the support you provide for your package!

                  Comment


                  • #10
                    OK, I've fixed that too, in v34.48 I had never actually run it before on sam files, just fasta and fastq, so thanks for finding those bugs!

                    FYI, the reason it does not complain if the filename is wrong for "names" is because if it is a filename, it will use the names in the file; if it is not a valid filename, it will use that as a literal list of names to filter against.

                    Comment


                    • #11
                      Originally posted by Brian Bushnell View Post
                      OK, I've fixed that too, in v34.48 I had never actually run it before on sam files, just fasta and fastq, so thanks for finding those bugs!

                      FYI, the reason it does not complain if the filename is wrong for "names" is because if it is a filename, it will use the names in the file; if it is not a valid filename, it will use that as a literal list of names to filter against.
                      Thank you! Just tested the new version and it's working now. And that makes sense about the "names" input. I will keep that in mind for the future.

                      Comment


                      • #12
                        Hello Brian,

                        Thanks for making this tool available. I am trying to use it to compare two sam files and find the reads that are unique to one of the bam files. However, I encountered the below error. Could you possibly point out the problem here?

                        Your help is highly appreciated.

                        Thank you,
                        SSC

                        Command:
                        bbmap/filterbyname.sh in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f

                        Console:
                        java -ea -Xmx55181m -cp /Align2MouseGenome/bbmap/current/ driver.FilterReadsByName in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f
                        Executing driver.FilterReadsByName [in=HumReads_BBv4.sam, names=MouReads_BBv4.sam, out=OnlyHumReads_BBv4.sam, include=f]

                        java.lang.AssertionError: 24056740, 24056739
                        at stream.SamLine.toRead(SamLine.java:1490)
                        at stream.SamLine.toRead(SamLine.java:1451)
                        at stream.SamReadInputStream.toReadList(SamReadInputStream.java:125)
                        at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:98)
                        at stream.SamReadInputStream.nextList(SamReadInputStream.java:81)
                        at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:667)
                        at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:646)
                        Time: 1900.461 seconds.
                        Reads Processed: 17281200 9.09k reads/sec
                        Bases Processed: 1732191375 0.91m bases/sec
                        Reads Out: 0
                        Bases Out: 0
                        Exception in thread "main" java.lang.RuntimeException: driver.FilterReadsByName terminated in an error state; the output may be corrupt.
                        at driver.FilterReadsByName.process(FilterReadsByName.java:363)
                        at driver.FilterReadsByName.main(FilterReadsByName.java:42)

                        Comment


                        • #13
                          @sschavan: It appears that you are trying to separate human/mouse data. While Brian responds to your specific question you can do the separation using BBSplit with original reads: http://seqanswers.com/forums/showthread.php?t=41288

                          Comment


                          • #14
                            Hello GenoMax,

                            Thank you for the suggestion, yes I am trying to separate Human and Mouse reads, I will look into BBSplit as well, however, I was wondering about the alignment stringency that is considered by BBSplit to call a read as mapped to Human or Mouse reference, and how much that would differ as compared to BWA/Stampy.

                            SSC

                            Comment


                            • #15
                              By default BBSplit uses fairly strict mapping parameters; you can get the same sensitivity as BBMap by adding the flags "minid=0.76 maxindel=16k minhits=1". With those parameters it is extremely sensitive.

                              BBSplit has different ambiguity settings for dealing with reads that map to multiple genomes. In any case, if the alignment score is higher to one genome than another, it will be associated with that genome only (this considers the combined scores of read pairs - pairs are always kept together). But when a read or pair has two identically-scoring mapping locations, on different genomes, the behavior is controlled by the "ambig2" flag - "ambig2=toss" will discard the read, "all" will send it to all output files, and "split" will send it to a separate file for ambiguously-mapped reads (one per genome to which it maps).

                              As for your specific error - that seems to be possibly related to a read with an invalid cigar string with more clipped bases than actual bases, or a read of zero length, or something like that. The program may have problems with secondary alignments in which the bases are suppressed (with a * symbol instead) - I have not extensively tested it on sam files, just fasta/fastq. I'll see if I can replicate and fix it (if it's a bug), and at least make it print out the actual line that causes the problem, since the current error message is not very useful. In the meantime, the error message appears to be harmless and you can skip it by disabling assertions with the -da flag:

                              bbmap/filterbyname.sh in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f -da

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              29 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X