Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates in samtools view when using ranges

    Well, I continue to ask newbie questions, but hopefully they're getting at least a little more interesting over time....

    The docs for "samtools view" say quite clearly that "An alignment may be given multiple times if it is overlapping several regions", and indeed that's true. Even when the regions don't themselves overlap, as in this example:

    samtools view -b full.bam chr1:23995228-23995342 chr1:23995353-23995376 > filtered.bam

    ...if the ranges are close enough together, the space between them will be spanned by certain reads. Any read that touches both ranges gets included twice.

    In this particular case, that's 25 duplicated reads. The entire sorted block of 25 reads gets included twice, which means that the resulting BAM has out-of-order reads. If it's out of order, it can't be indexed until it's resorted, which takes as much time as the filtering did in the first place. Even once you re-sort, you've still got duplicates. Alas, rmdup doesn't seem to have a mode for removing completely identical reads!

    So how do you eliminate true read duplicates in your BAMs? I could "just" stream it through a UNIX-level command, but it would have to go through SAM and back to BAM, which requires a reference list and, presumably, a lot more time.

  • #2
    Originally posted by jdrum00 View Post
    Well, I continue to ask newbie questions, but hopefully they're getting at least a little more interesting over time....

    The docs for "samtools view" say quite clearly that "An alignment may be given multiple times if it is overlapping several regions", and indeed that's true. Even when the regions don't themselves overlap, as in this example:

    samtools view -b full.bam chr1:23995228-23995342 chr1:23995353-23995376 > filtered.bam

    ...if the ranges are close enough together, the space between them will be spanned by certain reads. Any read that touches both ranges gets included twice.

    In this particular case, that's 25 duplicated reads. The entire sorted block of 25 reads gets included twice, which means that the resulting BAM has out-of-order reads. If it's out of order, it can't be indexed until it's resorted, which takes as much time as the filtering did in the first place. Even once you re-sort, you've still got duplicates. Alas, rmdup doesn't seem to have a mode for removing completely identical reads!

    So how do you eliminate true read duplicates in your BAMs? I could "just" stream it through a UNIX-level command, but it would have to go through SAM and back to BAM, which requires a reference list and, presumably, a lot more time.
    A solution would be to write a program that reads in BAMs using the samtools library. If you assume the ranges are non-overlapping and in ascending order, you can store the chr/pos of the previous alignment outputted by the above and then discard any with a smaller chr/pos than the previous, otherwise update the previous. For an example in C (you can use the Java library, Perl library, python library, etc):

    Code:
    #include <stdlib.h>
    #include <stdio.h>
    #include <zlib.h>
    
    // Change these to point to your the samtools source files
    #include "../samtools/bam.h"
    #include "../samtools/sam.h"
    
    #define Name "dbamtest"
    
    int main(int argc, char *argv[])
    {
        samfile_t *fp_in = NULL, *fp_out = NULL;
        bam1_t *b=NULL;
    
        if(2 == argc) {
            fp_in = samopen(argv[1], "rb", 0); 
            if(NULL == fp_in) {
                perror("Could not open file for reading");
                return 1;
            }   
            fp_out = samopen("-", "wb", fp_in->header);
            if(NULL == fp_out) {
                perror("Could not open stdout for writing");
                return 1;
            }
    
            int32_t prev_tid=-1, prev_pos=-1;
            b = bam_init1();
            while(0 < samread(fp_in, b)) {
                if(prev_tid < b->core.tid || (prev_tid == b->core.tid && prev_pos <= b->core.pos)) {
                    if(samwrite(fp_out, b) <= 0) {
                        perror("Could not write sam entry");
                        return 1;
                    }
                    prev_tid = b->core.tid;
                    prev_pos = b->core.pos;
                }
     // Destroy bam
                bam_destroy1(b);
                // Reinitialize
                b = bam_init1();
            }
            // Free
            bam_destroy1(b);
    
            /* Close the file */
            samclose(fp_in);
            samclose(fp_out);
        }
        else {
            fprintf(stderr, "Usage: %s [OPTIONS]\n", Name);
            fprintf(stderr, "\t<bam file>\n");
        }
    
        return 0;
    }
    You will have to compile against the samtools source code (here's a snippet from my Makefile.am):
    Code:
                            
                            ../samtools/bam_import.c \
                            ../samtools/bam_aux.c \
                            ../samtools/bam_pileup.c \
                            ../samtools/bam.c ../samtools/bam.h \
                            ../samtools/sam.c ../samtools/sam.h \
                            ../samtools/faidx.c ../samtools/faidx.h \
                            ../samtools/razf.c ../samtools/razf.h \
                            ../samtools/bgzf.c    ../samtools/bgzf.h \
                            ../samtools/kstring.c ../samtools/kstring.h \
                            ../samtools/bam_color.c
    Note: The above code has not been tested and there is no warranty either explicit or implied.

    Comment


    • #3
      Originally posted by nilshomer View Post
      A solution would be to write a program that reads in BAMs using the samtools library.
      Wow; yes, that's definitely taking it to the next level! I've seen that there are Python bindings; I'll give that a shot, using your code as a guide.

      Originally posted by nilshomer View Post
      Note: The above code has not been tested and there is no warranty either explicit or implied.
      Understood. I'll post my version, if I manage to make it work, to see if I can help the next guy. Thanks a bunch for the pointer; I have a feeling I'm going to need this sort of thing more and more, so it's probably good to get to know the libraries sooner rather than later.

      Comment


      • #4
        The simpler way is to merge adjacent regions close to each other. This will grab a little bit more reads in the output, but I believe this does not matter too much.

        Comment


        • #5
          In addition, here shows the preferred way to link against samtools:

          Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.

          Comment


          • #6
            Originally posted by lh3 View Post
            The simpler way is to merge adjacent regions close to each other. This will grab a little bit more reads in the output, but I believe this does not matter too much.
            I started to do this, but I wasn't sure how rigorous I needed to be with the data, and the person I would need to ask is out of town.

            Though, actually.... If two ranges are close enough to be spanned by a read, then there's no possible read that could fall between them and be considered outside the set of ranges.

            So, closing gaps of the read length or less between ranges can't actually introduce extraneous reads, only eliminate duplicates. Thanks for helping me think that through!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 11:49 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X