Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • pybedtools - a Python wrapper for BEDTools

    [cross-posted to BEDTools mailing list]

    I use both BEDTools and Python a lot. So I've posted a new Python package, pybedtools, which wraps BEDTools programs in an easy-to-use Pythonic interface. The GitHub repo is here:

    Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic") - daler/pybedtools


    and you can view the docs here:



    Note this is different from the recently-released "bedtools-python", which is an API for low-level access to interval files. In contrast, "pybedtools" tries to give you all the functionality of BEDTools from your Python code.

    This is still very much in development so not all parts of BEDTools are fully supported yet. I've tried to focus on documentation and doctests for this release; future releases will be more code-centric.

    For bug reports/feature requests, please use either this thread or the BEDTools mailing list.

    -Ryan

  • #2
    Version 0.5 of pybedtools has now been released. There have been many, many improvements since the last version, thanks to collaboration with Brent Pedersen and Aaron Quinlan.

    In addition to wrapping all the BEDTools programs (including the latest multiBamCov, tagBam, and nucBed programs) and making them accessible from within Python, pybedtools now extends BEDTools by allowing feature-by-feature manipulation of BED/GFF/GTF/BAM/SAM/VCF files.

    There's lots more that pybedtools provides . . . as a brief example, here's the complete code that identifies genes that are <5kb from intergenic SNPs, given a file of genes and a file of SNPs:

    Code:
    from pybedtools import BedTool
    snps = BedTool('snps.bed.gz')
    genes = BedTool('hg19.gff')
    intergenic_snps = (snps - genes)
    nearby = genes.closest(intergenic_snps, d=True, stream=True)
    for gene in nearby:
        if int(gene[-1]) < 5000:
            print gene.name
    In contrast, here's how to do the same analysis in a shell script, which requires knowledge of bash, awk, and Perl:

    Code:
    snps=../test/data/snps.bed.gz
    genes=../test/data/hg19.gff
    intergenic_snps=/tmp/intergenic_snps
    
    snp_fields=`zcat $snps | awk '(NR == 2){print NF; exit;}'`
    gene_fields=9
    distance_field=$(($gene_fields + $snp_fields + 1))
    
    intersectBed -a $snps -b $genes -v > $intergenic_snps
    
    closestBed -a $genes -b $intergenic_snps -d \
    | awk '($'$distance_field' < 5000){print $9;}' \
    | perl -ne 'm/[ID|Name|gene_id]=(.*?);/; print "$1\n"'
    
    rm $intergenic_snps
    (You can read notes on this comparison at http://packages.python.org/pybedtool...omparison.html)


    Full documentation: http://packages.python.org/pybedtools/
    Source (contributions welcome): https://github.com/daler/pybedtools

    Or you can get a brief overview of pybedtools in:

    Pybedtools: a flexible Python library for manipulating genomic datasets and annotations
    Ryan K. Dale, Brent S. Pedersen, and Aaron R. Quinlan
    Bioinformatics (2011) first published online September 23, 2011
    doi:10.1093/bioinformatics/btr539

    Comment


    • #3
      ftp

      Thanks for pybedtools. It has been very helpful. Do you know if I can use remote bam files (FTP) with pybedtools yet? I know this feature was just added to Bedtools, but I can't seem to figure out how to use it.

      Thanks,
      Mary

      Comment


      • #4
        Thanks for pointing this out. I've just added support for remote BAMs in the development version (https://github.com/daler/pybedtools).

        When creating a new BedTool object, you can now use the remote=True keyword argument. Here's an example that grabs the first 10 lines of a remote BAM and converts to BED:

        Code:
        import pybedtools
        url = 'ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/exome_alignment/HG00096.chrom11.ILLUMINA.bwa.GBR.exome.20120522.bam'
        x = pybedtools.BedTool(url, remote=True)
        y = x.bam_to_bed(stream=True)
        for i, feature in enumerate(y):
            if i == 9:
                break
            print feature,

        this prints:

        Code:
         11	60636	60736	SRR081241.13799221/1	0	+
         11	60674	60774	SRR077487.5548889/1	0	+
         11	60684	60784	SRR077487.12853301/1	0	+
         11	60789	60889	SRR077487.5548889/2	0	-
         11	60950	61050	SRR077487.13826494/1	0	+
         11	60959	61059	SRR081241.13799221/2	0	-
         11	61052	61152	SRR077487.12853301/2	0	-
         11	61548	61648	SRR081241.16743804/2	0	+
         11	61665	61765	SRR081241.16743804/1	0	-

        Keep in mind that doing more extensive work that may need to read large parts of the remote file (like intersections) may take a long time.

        Comment


        • #5
          implementation question

          UPDATE: Please ignore! I had a problem with my bedtools installation, fixed that and now this is no longer an issue! -_-

          Hello, pybedtools looks absolutely perfect for what I need to do, but I have a newbie implementation question.

          I'm trying to get the simple examples working, but I'm having trouble with the subtract function.
          I've just installed pybedtools-0.6.7 on a mac and i'm running:

          Code:
          import pybedtools
          
          # set up 3 different bedtools
          a = pybedtools.BedTool(pybedtools.example_filename('a.bed'))
          b = pybedtools.BedTool(pybedtools.example_filename('a.bed'))
          c = pybedtools.BedTool(pybedtools.example_filename('a.bed'))
          print a.count()
          print a
          print a.subtract(b)
          the result is:
          Code:
          4
          chr1	1	100	feature1	0	+
          chr1	100	200	feature2	0	+
          chr1	150	500	feature3	0	-
          chr1	900	950	feature4	0	+
          
          Traceback (most recent call last):
            File "test.py", line 9, in <module>
              print a.subtract(b)
            File "/Library/Python/2.7/site-packages/pybedtools-0.6.7-py2.7-macosx-10.8-intel.egg/pybedtools/bedtool.py", line 664, in decorated
              result = method(self, *args, **kwargs)
            File "/Library/Python/2.7/site-packages/pybedtools-0.6.7-py2.7-macosx-10.8-intel.egg/pybedtools/bedtool.py", line 146, in not_implemented_func
              raise NotImplementedError(help_str)
          NotImplementedError: "subtractBed" does not appear to be installed or on the path, so this method is disabled.  Please install a more recent version of BEDTools and re-import to use this method.
          I have the same issue when I try version 0.6.6 on a mac or 0.6.7 on linux. Have I missed something basic? Neither (a-b-c).count() nor (a-b).count() work for me either

          Thanks!!
          Last edited by agroff11; 09-30-2014, 11:04 AM. Reason: answered own question

          Comment


          • #6
            pybedtools wraps and extends BEDTools, so you need to make sure BEDTools is available on your path.

            What version of BEDTools are you running? The easiest way to find out is by typing

            Code:
            bedtools --version
            in the terminal.

            Comment


            • #7
              thanks

              Thank you! I had not updated my path post installation

              Comment


              • #8
                OK, good -- glad it works.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                31 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X