SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GenomeWide motif prediction anagari Bioinformatics 0 12-06-2011 08:46 AM
ChIP-Seq: Fish the ChIPs: a pipeline for automated genomic annotation of ChIP-Seq dat Newsbot! Literature Watch 0 10-08-2011 02:40 AM
ChIP-Seq: DREME: Motif discovery in transcription factor ChIP-seq data. Newsbot! Literature Watch 0 05-06-2011 03:10 AM
ChIP-Seq: MEME-ChIP: motif analysis of large DNA datasets. Newsbot! Literature Watch 0 04-14-2011 02:50 AM
ChIP-Seq: GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experi Newsbot! Literature Watch 0 11-18-2010 02:20 AM

Reply
 
Thread Tools
Old 07-07-2010, 06:28 AM   #1
simonvh
Member
 
Location: NCMLS, Nijmegen

Join Date: Jul 2010
Posts: 12
Default GimmeMotifs: a ChIP-seq motif prediction pipeline

Hello all,

As we're working with a lot of ChIP-seq data in our lab, we needed a tool to reliably predict motifs de novo from our peaks. The approach we developed might be useful to others, so I'd like to point you to the website:
http://www.ncmls.eu/bioinfo/gimmemotifs/

Basically, the approach is to run several different algorithms (as was suggested in some benchmark studies and reviews), and combine the output into a non-redundant list of motifs. Long-time favorites such as MEME and MotifSampler are included, as well as some more recent tools developed for ChIP-seq (or ChIP-chip) data including trawler and MoAn.
To rank and evaluate the motifs we predict motifs on a part of the dataset, and use the rest for evaluation (enrichment, ROC curve, MNCP score).

You can see an example of the output here (this is for a ChIP-seq experiment with the transcription factor p63):
http://www.ncmls.eu/bioinfo/gimmemot...if_report.html

The package is implemented in Python, and can be freely downloaded. Installation is somewhat of a hassle as all the different tools need to be installed and configured separately, but other than that I hope that the installation procedure is smooth and documented.

Please let me know if you find GimmeMotifs useful, have any questions or notice any bugs or omissions in the documentation.

Simon
simonvh is offline   Reply With Quote
Old 07-07-2010, 08:45 PM   #2
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Hi Simon - this looks pretty neat, Im installing it now and pester you with questions!
frozenlyse is offline   Reply With Quote
Old 07-07-2010, 11:15 PM   #3
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

First problem I've overcome is some strange incompatibility between parallel python (python-pp version 1.5.7-1) and numpy using the Ubuntu 10.04 repository versions, I solved this by installing version 1.6.0-RC5 of parallel python from here and I am now up and running the included example using using meme, Weeder, MDmodule, gadem

Which version of parallel python are you developing with? It could be a bug specific to my system as it hasnt had a clean install since Ubuntu 8.04
frozenlyse is offline   Reply With Quote
Old 07-08-2010, 12:08 AM   #4
simonvh
Member
 
Location: NCMLS, Nijmegen

Join Date: Jul 2010
Posts: 12
Default

Hmm that's strange. I'm using version 1.5.7 of pp in combination with numpy version 1.4.1, and that works fine. Which version of numpy is in the Ubuntu repositories? Are you running Python 2.6?
Was it similar to this bug: http://www.parallelpython.com/compon...9/topic,413.0?

Let me know if using pp 1.6.0 resolves the issue.
simonvh is offline   Reply With Quote
Old 07-08-2010, 12:20 AM   #5
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Yeah that link is where I got the idea to install pp 1.6.0 (ubuntu numpy is only version 1.3.0, if I have more troubles I'll try upgrading that next), all using python 2.6

I've run into a few bugs in gimmemotifs that I'm fixing along the way, you should see a pull request on your github soon! (though I'm no python developer)
frozenlyse is offline   Reply With Quote
Old 07-08-2010, 02:51 AM   #6
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Ok I've gotten it to successfully run the included example - what I had to do was remove the Ubuntu versions of numpy (therefore matplotlib), scipy and parallel python and install from source

numpy-1.4.1
scipy-0.8.0rc1
pp-1.5.7 (doesn't work with pp-1.6.0rc5)
matplotlib-0.99.3

Its now running on one of my .bed files output from MACS - I had to remove trim it down to a 3 column bed to get it to work, what does gimmemotifs use the 4th column for?

But so for this looks pretty useful, thanks for releasing it
frozenlyse is offline   Reply With Quote
Old 07-08-2010, 03:20 AM   #7
simonvh
Member
 
Location: NCMLS, Nijmegen

Join Date: Jul 2010
Posts: 12
Default

Thanks for finding and fixing some of the bugs

I will have a look at the input format. I should fix it, so that any file in valid BED format is accepted. The fourth column is used to sort the peaks (we usually have the nr of reads in there). This is for the benefit of MDmodule, which actually uses the ranking of the sequences in the motif search. However, if there is no numerical value in the fourth column, it should just be left unused, instead of choking on that input.
simonvh is offline   Reply With Quote
Old 07-08-2010, 06:27 AM   #8
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Please add an entry in the software wiki; otherwise you're stuck with what I put there!
krobison is offline   Reply With Quote
Old 07-08-2010, 11:09 PM   #9
simonvh
Member
 
Location: NCMLS, Nijmegen

Join Date: Jul 2010
Posts: 12
Default

Ah, yes, that was on my to-do list, it's good to be reminded. Done
simonvh is offline   Reply With Quote
Old 11-18-2010, 04:51 AM   #10
simonvh
Member
 
Location: NCMLS, Nijmegen

Join Date: Jul 2010
Posts: 12
Default

I just wanted to let you know that GimmeMotifs has been accepted for publication in Bioinformatics:
doi: 10.1093/bioinformatics/btq636.

The installation procedure has been simplified, and packages for Ubuntu, Debian and Fedora are now available. If you need motif prediction for ChIP-seq data, give it a try and let me know what you think: http://www.ncmls.nl/bioinfo/gimmemotifs/.
simonvh is offline   Reply With Quote
Old 05-07-2014, 12:05 AM   #11
krespim
Member
 
Location: Dresden

Join Date: Jul 2012
Posts: 49
Default

Hi Simon,

first of all thank you for the tool. I am now preparing to try it out but since my data is a tad tricky I was wondering if you could give some hints on how to best set-up the run.

The issue is that the peaks are not from ChIP-seq but from DamID-seq. This means that the motif might not not be necessarily located in middle of the peak and the peaks - if one can called them that - can be quite broad (from a 100bp to >5kb). This is for a transcription factor btw.

So the question is, do you have any recommendations when analysing data from this type of experiment (or similar)? At the moment what I am selecting peaks less than 1kb to use as an input.
krespim is offline   Reply With Quote
Old 05-12-2014, 11:10 PM   #12
simonvh
Member
 
Location: NCMLS, Nijmegen

Join Date: Jul 2010
Posts: 12
Default

This is indeed trickier than a typical ChIP-seq run, but most likely not impossible. Basically there's two important things here. First is, the fact that the motif is not located in the center of the peak. Most motif programs that are run by GimmeMotifs do not take the location of the motif in the sequence into account. However, by default GimmeMotifs truncates the input sequences to 200 basepairs. This is probably too strict in your case. So I would change the -w parameter to 1000 to use 1kb sequences for searching. Otherwise, even if your input sequences are 1kb, only 200bp would be used as input.
Second is the "peak" size. If you have enough regions smaller than 1kb, I would indeed use these for motif searching. You can later always check the presence of the motif in the larger sequences. Otherwise you can just use all regions as input, as GimmeMotifs will truncate the larger sequences. If there's enough sequences that contain a motif, this should not be that big of a problem.
simonvh is offline   Reply With Quote
Old 05-13-2014, 07:15 AM   #13
krespim
Member
 
Location: Dresden

Join Date: Jul 2012
Posts: 49
Default

Quote:
Originally Posted by simonvh View Post
This is indeed trickier than a typical ChIP-seq run, but most likely not impossible. Basically there's two important things here. First is, the fact that the motif is not located in the center of the peak. Most motif programs that are run by GimmeMotifs do not take the location of the motif in the sequence into account. However, by default GimmeMotifs truncates the input sequences to 200 basepairs. This is probably too strict in your case. So I would change the -w parameter to 1000 to use 1kb sequences for searching. Otherwise, even if your input sequences are 1kb, only 200bp would be used as input.
Second is the "peak" size. If you have enough regions smaller than 1kb, I would indeed use these for motif searching. You can later always check the presence of the motif in the larger sequences. Otherwise you can just use all regions as input, as GimmeMotifs will truncate the larger sequences. If there's enough sequences that contain a motif, this should not be that big of a problem.

Thanks a lot for the suggestions. After posing the question, I selected regions up to 500bpand also up to 1kb (always setting the -w parameter). And got a similar motifs with both which is comforting. The pwmscan.py also came in handy.

Just another couple of things:

1. I looked at the manual, could not find a description of the output of pwmscan.py.

2. The results I have for my best motif look good from my interpretation of the report. Is this correct? Here are the results:
random
enrichment 6.00
p-value 0.00
ROC_AUC 0.703
MNCP 4.116

genomic_matched
enrichment 2.25
p-value 0.00
ROC_AUC 0.695
MNCP 1.808


The p-value=0 is the one that is bugging me.
krespim is offline   Reply With Quote
Old 07-21-2016, 07:21 AM   #14
dzavallo
Member
 
Location: argentina

Join Date: Apr 2011
Posts: 16
Default

Dear Simon


We are contacting you as user of your gimmemotif pipeline.
We are trying to use the roc.py and cluster.py scripts with a file (PWMFILE) which is not derived from gimmemotif. Instead the matrix I am trying to run is composed by results I ve got with another predictor scripts. The error message I ve got in trying to run the ROC script is:

comand:
gimme roc -o kentaro_roc.pdf kentaro2_julio2016 nuevalista_junio2016.fasta 10000_random_promoters_1500pb_masked_not_E011.fasta

error:
failed to initialize cache
global name 'make_region' is not defined
Traceback (most recent call last):
File "/tools/anaconda2/bin/gimme", line 469, in <module>
args.func(args)
File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/commands/roc.py", line 40, in roc
for scores in s.best_score(fg_file):
File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 270, in best_score
for matches in self.scan(seqs, 1, scan_rc, cutoff=0):
File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 355, in scan
for result in it:
File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 418, in _scan_sequences
motif_digest = self.checksum[motif_file]
KeyError: 'kentaro2_julio2016.txt'


In a previous version of gimmemotif, I was able to do this, but I noticed that after GM update the input file (PWMFILE) is not recognized. I attached here the mentioned matrix for you to see whether the error could be.

In trying to bypass this trouble, I started from the very beginning running the whole gimmemotif pipeline (including all predictors). However, in the step where I have to give a fasta file with the whole genome sequence to take as background, I failed in indexing the whole tomato genome (my samples are from this species). The error message I ve got in this opportunity is:

comand:
gimme background -i SL_todoscrom.fa -f SL.fa -g 2.3 -n 1

error:
background: error: too few arguments


Thank you very much in advance for your help with this. Your comments and suggestions are more than welcome.

Best wishes,
dzavallo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO