Seqanswers Leaderboard Ad

**nilshomer** · 12-17-2009, 10:12 AM

Originally posted by jdrum00 View Post

Well, I continue to ask newbie questions, but hopefully they're getting at least a little more interesting over time....

The docs for "samtools view" say quite clearly that "An alignment may be given multiple times if it is overlapping several regions", and indeed that's true. Even when the regions don't themselves overlap, as in this example:

samtools view -b full.bam chr1:23995228-23995342 chr1:23995353-23995376 > filtered.bam

...if the ranges are close enough together, the space between them will be spanned by certain reads. Any read that touches both ranges gets included twice.

In this particular case, that's 25 duplicated reads. The entire sorted block of 25 reads gets included twice, which means that the resulting BAM has out-of-order reads. If it's out of order, it can't be indexed until it's resorted, which takes as much time as the filtering did in the first place. Even once you re-sort, you've still got duplicates. Alas, rmdup doesn't seem to have a mode for removing completely identical reads!

So how do you eliminate true read duplicates in your BAMs? I could "just" stream it through a UNIX-level command, but it would have to go through SAM and back to BAM, which requires a reference list and, presumably, a lot more time.

A solution would be to write a program that reads in BAMs using the samtools library. If you assume the ranges are non-overlapping and in ascending order, you can store the chr/pos of the previous alignment outputted by the above and then discard any with a smaller chr/pos than the previous, otherwise update the previous. For an example in C (you can use the Java library, Perl library, python library, etc):

Code:

#include <stdlib.h>
#include <stdio.h>
#include <zlib.h>

// Change these to point to your the samtools source files
#include "../samtools/bam.h"
#include "../samtools/sam.h"

#define Name "dbamtest"

int main(int argc, char *argv[])
{
    samfile_t *fp_in = NULL, *fp_out = NULL;
    bam1_t *b=NULL;

    if(2 == argc) {
        fp_in = samopen(argv[1], "rb", 0); 
        if(NULL == fp_in) {
            perror("Could not open file for reading");
            return 1;
        }   
        fp_out = samopen("-", "wb", fp_in->header);
        if(NULL == fp_out) {
            perror("Could not open stdout for writing");
            return 1;
        }

        int32_t prev_tid=-1, prev_pos=-1;
        b = bam_init1();
        while(0 < samread(fp_in, b)) {
            if(prev_tid < b->core.tid || (prev_tid == b->core.tid && prev_pos <= b->core.pos)) {
                if(samwrite(fp_out, b) <= 0) {
                    perror("Could not write sam entry");
                    return 1;
                }
                prev_tid = b->core.tid;
                prev_pos = b->core.pos;
            }
 // Destroy bam
            bam_destroy1(b);
            // Reinitialize
            b = bam_init1();
        }
        // Free
        bam_destroy1(b);

        /* Close the file */
        samclose(fp_in);
        samclose(fp_out);
    }
    else {
        fprintf(stderr, "Usage: %s [OPTIONS]\n", Name);
        fprintf(stderr, "\t<bam file>\n");
    }

    return 0;
}

You will have to compile against the samtools source code (here's a snippet from my Makefile.am):

Code:

                        
                        ../samtools/bam_import.c \
                        ../samtools/bam_aux.c \
                        ../samtools/bam_pileup.c \
                        ../samtools/bam.c ../samtools/bam.h \
                        ../samtools/sam.c ../samtools/sam.h \
                        ../samtools/faidx.c ../samtools/faidx.h \
                        ../samtools/razf.c ../samtools/razf.h \
                        ../samtools/bgzf.c    ../samtools/bgzf.h \
                        ../samtools/kstring.c ../samtools/kstring.h \
                        ../samtools/bam_color.c

Note: The above code has not been tested and there is no warranty either explicit or implied.

**jdrum00** · 12-17-2009, 10:47 AM

Originally posted by nilshomer View Post

A solution would be to write a program that reads in BAMs using the samtools library.

Wow; yes, that's definitely taking it to the next level! I've seen that there are Python bindings; I'll give that a shot, using your code as a guide.

Originally posted by nilshomer View Post

Note: The above code has not been tested and there is no warranty either explicit or implied.

Understood. I'll post my version, if I manage to make it work, to see if I can help the next guy. Thanks a bunch for the pointer; I have a feeling I'm going to need this sort of thing more and more, so it's probably good to get to know the libraries sooner rather than later.

**lh3** · 12-17-2009, 11:52 AM

The simpler way is to merge adjacent regions close to each other. This will grab a little bit more reads in the output, but I believe this does not matter too much.

**lh3** · 12-17-2009, 11:53 AM

In addition, here shows the preferred way to link against samtools:

SAM tools

http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_protocol#C_APIs_.28SAMtools-C.29

Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.

**jdrum00** · 12-17-2009, 12:48 PM

Originally posted by lh3 View Post

The simpler way is to merge adjacent regions close to each other. This will grab a little bit more reads in the output, but I believe this does not matter too much.

I started to do this, but I wasn't sure how rigorous I needed to be with the data, and the person I would need to ask is out of town.

Though, actually.... If two ranges are close enough to be spanned by a read, then there's no possible read that could fall between them and be considered outside the set of ranges.

So, closing gaps of the read length or less between ranges can't actually introduce extraneous reads, only eliminate duplicates. Thanks for helping me think that through!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Duplicates in samtools view when using ranges

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News