Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastest way to find overlaps between multiple regions and a bed file

    Dear experts,

    I have a sorted large BED file (main.bed) and several regions defined in a file (regions.bed) which can be overlapping. I am trying to find a quick way to overlap the main BED file with the regions defined in regions.bed, and create a separate BED output file for each of these regions.

    I can do it using a loop with bash by looping through each line of main.bed and finding the regions that overlap regions.bed though an awk command, but it is extremely slow. I can also do it with intersectBed from bedtools but I am still using a loop to specify each region... Is there a better way to do this?

    Thank you!
    Francesca
    Last edited by francy; 09-20-2015, 05:18 PM.

  • #2
    Just directly intersect the two BED files rather than looping over anything. If you need the results in different files for each region then it'd be faster to pipe the output to awk and have it place things in different files as appropriate (yes, you can do redirection within awk...this is useful in such cases).

    Comment


    • #3
      Thank you dpryan, is there a way I can intersect bed files to get all the values in the main.bed that are included in the regions.bed (so in a way subsetting the main.bed -- including repeated entries in main.bed since there are overlapping regions-- and splitting these by the regions)? What I am looking for is not the overlap between the two bed files, but the subset of overlaps in main.bed that are included in each region defined by regions.bed. The output file for each region should contain many entries from main.bed, and entries in each output files could be duplicated with other output files since the regions can be overlapping... Is this possible to do?

      cat main.bed
      chr11 13302 13303 1
      chr11 13980 13981 1
      chr11 51476 51477 1

      cat regions.bed:
      chr11 13202 14981 2
      chr11 13980 51477 2

      And the output should be:
      cat res.region1
      chr11 13302 13303 1
      chr11 13980 13981 1

      cat res.region2
      chr11 13980 13981 1
      chr11 51476 51477 1
      Last edited by francy; 09-21-2015, 05:21 AM.

      Comment


      • #4
        Sure, but the output file names will probably need to be something like "chr11_10000_14000" and "chr11_13500_55000". You could use awk to do that. Have it generate the file name based on the regions.bed file entry and then only print the columns from main.bed to it (make sure to use ">>" rather than ">").

        Comment


        • #5
          Ok... but I don't know what to pipe, is there a software that will let me get all the values in the main.bed that are included in the regions.bed, including repeated entries, in a way that I could then easily pipe the resulting BED file and split by another column for example? Or is the only solution to get the subset of main.bed included in regions.bed using a loop for each of the lines in the regions.bed with intersectBed command for example? Thank you
          Last edited by francy; 09-21-2015, 05:52 AM.

          Comment


          • #6
            Pipe the output of bedtools intersect:

            Code:
            $ cat regions.bed 
            chr1	100	200
            chr1	250	300
            Code:
            $ cat main.bed 
            chr1	0	10
            chr1	1	100
            chr1	100	150
            chr1	110	150
            chr1	150	200
            chr1	150	200
            chr1	150	300
            Code:
            $ bedtools intersect -wao -a regions.bed -b main.bed 
            chr1	100	200	chr1	100	150	50
            chr1	100	200	chr1	110	150	40
            chr1	100	200	chr1	150	200	50
            chr1	100	200	chr1	150	200	50
            chr1	100	200	chr1	150	300	50
            chr1	250	300	chr1	150	300	50
            Note that entries with no overlap are also included and can easily be ignored. The last column is the number of overlapping bases, in case that's useful for filtering too. You can specify -sorted if both files are sorted and then things will run quicker.

            Comment


            • #7
              I see, thank you!
              Using your command line doesn't give me unmatched and this is perfect. Thanks!!!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              50 views
              0 likes
              Last Post seqadmin  
              Working...
              X