SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Splitting a bed file in multiple bed files by track albireo Bioinformatics 3 12-18-2014 11:58 AM
How does one find upstream regions of a gene from BED files anibhax Bioinformatics 6 06-03-2013 02:51 PM
find overlaps/common in multiple bed file epi Bioinformatics 11 02-05-2013 05:47 AM
finding common genomic regions from multiple (>2) BED files a_mt Bioinformatics 4 01-31-2013 03:00 PM

Reply
 
Thread Tools
Old 09-20-2015, 04:33 PM   #1
francy
Member
 
Location: London

Join Date: Jun 2011
Posts: 19
Default Fastest way to find overlaps between multiple regions and a bed file

Dear experts,

I have a sorted large BED file (main.bed) and several regions defined in a file (regions.bed) which can be overlapping. I am trying to find a quick way to overlap the main BED file with the regions defined in regions.bed, and create a separate BED output file for each of these regions.

I can do it using a loop with bash by looping through each line of main.bed and finding the regions that overlap regions.bed though an awk command, but it is extremely slow. I can also do it with intersectBed from bedtools but I am still using a loop to specify each region... Is there a better way to do this?

Thank you!
Francesca

Last edited by francy; 09-20-2015 at 05:18 PM.
francy is offline   Reply With Quote
Old 09-21-2015, 04:20 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Just directly intersect the two BED files rather than looping over anything. If you need the results in different files for each region then it'd be faster to pipe the output to awk and have it place things in different files as appropriate (yes, you can do redirection within awk...this is useful in such cases).
dpryan is offline   Reply With Quote
Old 09-21-2015, 05:11 AM   #3
francy
Member
 
Location: London

Join Date: Jun 2011
Posts: 19
Default

Thank you dpryan, is there a way I can intersect bed files to get all the values in the main.bed that are included in the regions.bed (so in a way subsetting the main.bed -- including repeated entries in main.bed since there are overlapping regions-- and splitting these by the regions)? What I am looking for is not the overlap between the two bed files, but the subset of overlaps in main.bed that are included in each region defined by regions.bed. The output file for each region should contain many entries from main.bed, and entries in each output files could be duplicated with other output files since the regions can be overlapping... Is this possible to do?

cat main.bed
chr11 13302 13303 1
chr11 13980 13981 1
chr11 51476 51477 1

cat regions.bed:
chr11 13202 14981 2
chr11 13980 51477 2

And the output should be:
cat res.region1
chr11 13302 13303 1
chr11 13980 13981 1

cat res.region2
chr11 13980 13981 1
chr11 51476 51477 1

Last edited by francy; 09-21-2015 at 05:21 AM.
francy is offline   Reply With Quote
Old 09-21-2015, 05:31 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Sure, but the output file names will probably need to be something like "chr11_10000_14000" and "chr11_13500_55000". You could use awk to do that. Have it generate the file name based on the regions.bed file entry and then only print the columns from main.bed to it (make sure to use ">>" rather than ">").
dpryan is offline   Reply With Quote
Old 09-21-2015, 05:46 AM   #5
francy
Member
 
Location: London

Join Date: Jun 2011
Posts: 19
Default

Ok... but I don't know what to pipe, is there a software that will let me get all the values in the main.bed that are included in the regions.bed, including repeated entries, in a way that I could then easily pipe the resulting BED file and split by another column for example? Or is the only solution to get the subset of main.bed included in regions.bed using a loop for each of the lines in the regions.bed with intersectBed command for example? Thank you

Last edited by francy; 09-21-2015 at 05:52 AM.
francy is offline   Reply With Quote
Old 09-21-2015, 05:56 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Pipe the output of bedtools intersect:

Code:
$ cat regions.bed 
chr1	100	200
chr1	250	300
Code:
$ cat main.bed 
chr1	0	10
chr1	1	100
chr1	100	150
chr1	110	150
chr1	150	200
chr1	150	200
chr1	150	300
Code:
$ bedtools intersect -wao -a regions.bed -b main.bed 
chr1	100	200	chr1	100	150	50
chr1	100	200	chr1	110	150	40
chr1	100	200	chr1	150	200	50
chr1	100	200	chr1	150	200	50
chr1	100	200	chr1	150	300	50
chr1	250	300	chr1	150	300	50
Note that entries with no overlap are also included and can easily be ignored. The last column is the number of overlapping bases, in case that's useful for filtering too. You can specify -sorted if both files are sorted and then things will run quicker.
dpryan is offline   Reply With Quote
Old 09-21-2015, 07:20 AM   #7
francy
Member
 
Location: London

Join Date: Jun 2011
Posts: 19
Default

I see, thank you!
Using your command line doesn't give me unmatched and this is perfect. Thanks!!!
francy is offline   Reply With Quote
Reply

Tags
bed files, bedtools intersect, overlap

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:32 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO