Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BEDTools Version 2.1

    Hi all,

    I updated the BEDTools utility "coverageBed" so that it now reports the density/breadth of coverage for a given interval. Specifically, for each interval in B, it reports:
    1) the number of overlapping features in A
    2) the number of bases in B that had non-zero coverage from A.
    3) the fraction (density) of non-zero bases in B covered by A.


    An example (note start coordinates are UCSC 0-based and thus are interpreted to actually be 1 greater):

    > cat A.bed
    chr1 10 20
    chr1 11 21
    chr1 12 25

    > cat B.bed
    chr1 0 50

    > coverageBed -a A.bed -b B.bed
    chr1 0 50 3 15 50 0.3

    where:
    column 4 is the number of intervals in A that overlap with B.
    column 5 is the number of bases in B with non-zero "coverage" from A.
    column 6 is the length of the interval in B.
    column 7 is the fraction of bases in B that have non-zero coverage from A.

    In essence, this describes the "breadth" of coverage, whereas column 4 describes the "depth". In this case, B is overlapped by 3 features in A and these features cover 30% of the 50bp interval in B.

    The newest version is here:
    http://people.virginia.edu/~arq5x/bedtools.html OR,
    Download BEDTools for free. BEDTools is a suite of utilities for comparing genomic features in BED format. These utilities allow one to quickly address tasks such as: 1.


    All the best,
    Aaron

  • #2
    BEDTools 2.1 fails to compile under linux because of the following line [bedFile.cpp]:

    bedEntry.minOverlapStart = INT_MAX;

    Comment


    • #3
      Thanks for finding this, pko. Ostensibly my system and those of other users allow for me to get away with omitting limits.h from that source file. I'll post a new version as soon as I get back from vacation. In the interim, if others face this problem, add the following to bedFile.h on line 12:

      #include <limits.h>

      Save, re-make and you should be good to go.

      Apologies and thanks much for pointing this out.

      Best,
      Aaron

      Comment


      • #4
        I'm having some strange issues with complementBed - it appears to be highly sensitive to convention used in the chromosome field. For example, this works:

        Bed file:

        chr21 32345 65443

        genome file:

        chr21 48099781

        but this gives no output:

        Bed file:

        hr21 32345 65443

        genome file:

        hr21 48099781


        Maybe there are some restrictions in the bed format that I'm unaware of? Haven't tested any of the other tools.

        Thanks,

        Dion

        Comment


        • #5
          complementBed

          Hi dlepp,
          Thanks for your post, this is a strange problem. I was able to recreate it as well. There is nothing that explicitly limits what can be used for the "chrom" field. The intent is that any string could be used. Oddly, it seems to be a problem with the C++ string tokenizing function I wrote, which is basically just lifted from a "best practices" book. To make things more odd, the following works (not h22 instead of hr22):

          Bed file:

          h21 32345 65443

          genome file:

          h21 48099781

          I tried using other tokenizing methods and the problem persists. I am on vacation until early August and will fix it when I return. In the meantime, if you just use chr22 or 22, all should be well.

          Thanks for pointing this out as it is a strange error that needs to be addressed.

          Best,
          Aaron

          Comment


          • #6
            BEDTools v2.1.1

            Hi,
            I have posted a new version (2.1.1) that addresses the issues that dlepp and pko have so kindly pointed out.

            I've posted it to http://people.virginia.edu/~arq5x/bedtools.html and will update sourceforge soon.

            Thanks again for letting me know of these problems.

            Best,
            Aaron

            Comment


            • #7
              Aaron,


              been trying bedTools for mapping SNPs to genomic features -- which often overlap. How does 'closestBed' handle these cases? E.g., two genes that overlap, and an SNP in the overlap region -- does it pick one gene at random? Amount of overlap is going to be identical in these cases.

              Thanks!

              Comment


              • #8
                closestBed

                Hi ohofmann,

                Currently, in such situations, closestBed will return the first feature that occurs in the feature file. This works well for larger intervals (e.g. genes, not SNPs), but in the case you describe, it really isn't ideal.

                My guess is that in this case, you'd prefer more control. For example:
                a) return _all_ features that overlap with the SNP.
                b) return the largest feature that overlaps with the SNP.
                c) return the smallest feature that overlaps with the SNP.
                d) randomly select a feature.

                All of these options are quite easy to implement. I can likely implement them this week or early next week if it helps you. To be precise, cases a-d will only be invoked when there are multiple features in B that have 100% overlap with the interval in A (in your case, a SNP). Otherwise, only the closest (i.e. closest non-overlapping or most overlapping) feature will be reported.

                Thanks for pointing this out.
                Aaron

                Comment


                • #9
                  Aaron,


                  not sure it's worth the hassle -- just adding the information to the man page should be more than enough. My current workflow, using the mapping of SNPs to genes within a 25kb window as an example:

                  * Run windowBed on all SNPs (streamed) vs a gene file, +/- 25kb, printing out all hits
                  * Cutting out the overlapping gene regions from the result file
                  * Sort/Unique to remove duplicate genes (not sure how closest handles those, just in case), likewise for SNPs (ensures to remove SNPs that do not have a gene within 25kb which otherwise might end up mapped to genes a few megabases away)

                  Take those files as input for closestBed. If an SNP actually overlaps more than one gene it probably makes sense to return all since closest really isn't defined. Closest to .. the start of a gene (depends on strand)? The UTR? Etc.

                  All features is quite likely the only alternative that makes sense in this context.

                  Best, Oliver

                  Comment


                  • #10
                    Hi Oliver,
                    I agree that returning all features either optionally or by default is best in this case. Such behavior would allow the user to "pipe" to a downstream Perl/AWK/Python/Ruby/VogueLanguageOfTheMonth in order to choose max, min, random, etc.

                    I'll try to knock this out in the next couple of weeks. Not hard, just difficult to find time at the moment.

                    Aaron

                    Comment


                    • #11
                      No rush at all, and thanks!

                      -- Oliver

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-27-2024, 06:37 PM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-27-2024, 06:07 PM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      56 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      70 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X