Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • simonvh
    Member
    • Jul 2010
    • 12

    GimmeMotifs: a ChIP-seq motif prediction pipeline

    Hello all,

    As we're working with a lot of ChIP-seq data in our lab, we needed a tool to reliably predict motifs de novo from our peaks. The approach we developed might be useful to others, so I'd like to point you to the website:


    Basically, the approach is to run several different algorithms (as was suggested in some benchmark studies and reviews), and combine the output into a non-redundant list of motifs. Long-time favorites such as MEME and MotifSampler are included, as well as some more recent tools developed for ChIP-seq (or ChIP-chip) data including trawler and MoAn.
    To rank and evaluate the motifs we predict motifs on a part of the dataset, and use the rest for evaluation (enrichment, ROC curve, MNCP score).

    You can see an example of the output here (this is for a ChIP-seq experiment with the transcription factor p63):


    The package is implemented in Python, and can be freely downloaded. Installation is somewhat of a hassle as all the different tools need to be installed and configured separately, but other than that I hope that the installation procedure is smooth and documented.

    Please let me know if you find GimmeMotifs useful, have any questions or notice any bugs or omissions in the documentation.

    Simon
  • frozenlyse
    Senior Member
    • Sep 2008
    • 135

    #2
    Hi Simon - this looks pretty neat, Im installing it now and pester you with questions!

    Comment

    • frozenlyse
      Senior Member
      • Sep 2008
      • 135

      #3
      First problem I've overcome is some strange incompatibility between parallel python (python-pp version 1.5.7-1) and numpy using the Ubuntu 10.04 repository versions, I solved this by installing version 1.6.0-RC5 of parallel python from here and I am now up and running the included example using using meme, Weeder, MDmodule, gadem

      Which version of parallel python are you developing with? It could be a bug specific to my system as it hasnt had a clean install since Ubuntu 8.04

      Comment

      • simonvh
        Member
        • Jul 2010
        • 12

        #4
        Hmm that's strange. I'm using version 1.5.7 of pp in combination with numpy version 1.4.1, and that works fine. Which version of numpy is in the Ubuntu repositories? Are you running Python 2.6?
        Was it similar to this bug: http://www.parallelpython.com/compon...9/topic,413.0?

        Let me know if using pp 1.6.0 resolves the issue.

        Comment

        • frozenlyse
          Senior Member
          • Sep 2008
          • 135

          #5
          Yeah that link is where I got the idea to install pp 1.6.0 (ubuntu numpy is only version 1.3.0, if I have more troubles I'll try upgrading that next), all using python 2.6

          I've run into a few bugs in gimmemotifs that I'm fixing along the way, you should see a pull request on your github soon! (though I'm no python developer)

          Comment

          • frozenlyse
            Senior Member
            • Sep 2008
            • 135

            #6
            Ok I've gotten it to successfully run the included example - what I had to do was remove the Ubuntu versions of numpy (therefore matplotlib), scipy and parallel python and install from source

            numpy-1.4.1
            scipy-0.8.0rc1
            pp-1.5.7 (doesn't work with pp-1.6.0rc5)
            matplotlib-0.99.3

            Its now running on one of my .bed files output from MACS - I had to remove trim it down to a 3 column bed to get it to work, what does gimmemotifs use the 4th column for?

            But so for this looks pretty useful, thanks for releasing it

            Comment

            • simonvh
              Member
              • Jul 2010
              • 12

              #7
              Thanks for finding and fixing some of the bugs

              I will have a look at the input format. I should fix it, so that any file in valid BED format is accepted. The fourth column is used to sort the peaks (we usually have the nr of reads in there). This is for the benefit of MDmodule, which actually uses the ranking of the sequences in the motif search. However, if there is no numerical value in the fourth column, it should just be left unused, instead of choking on that input.

              Comment

              • krobison
                Senior Member
                • Nov 2007
                • 734

                #8
                Please add an entry in the software wiki; otherwise you're stuck with what I put there!

                Comment

                • simonvh
                  Member
                  • Jul 2010
                  • 12

                  #9
                  Ah, yes, that was on my to-do list, it's good to be reminded. Done

                  Comment

                  • simonvh
                    Member
                    • Jul 2010
                    • 12

                    #10
                    I just wanted to let you know that GimmeMotifs has been accepted for publication in Bioinformatics:
                    doi: 10.1093/bioinformatics/btq636.

                    The installation procedure has been simplified, and packages for Ubuntu, Debian and Fedora are now available. If you need motif prediction for ChIP-seq data, give it a try and let me know what you think: http://www.ncmls.nl/bioinfo/gimmemotifs/.

                    Comment

                    • krespim
                      Member
                      • Jul 2012
                      • 49

                      #11
                      Hi Simon,

                      first of all thank you for the tool. I am now preparing to try it out but since my data is a tad tricky I was wondering if you could give some hints on how to best set-up the run.

                      The issue is that the peaks are not from ChIP-seq but from DamID-seq. This means that the motif might not not be necessarily located in middle of the peak and the peaks - if one can called them that - can be quite broad (from a 100bp to >5kb). This is for a transcription factor btw.

                      So the question is, do you have any recommendations when analysing data from this type of experiment (or similar)? At the moment what I am selecting peaks less than 1kb to use as an input.

                      Comment

                      • simonvh
                        Member
                        • Jul 2010
                        • 12

                        #12
                        This is indeed trickier than a typical ChIP-seq run, but most likely not impossible. Basically there's two important things here. First is, the fact that the motif is not located in the center of the peak. Most motif programs that are run by GimmeMotifs do not take the location of the motif in the sequence into account. However, by default GimmeMotifs truncates the input sequences to 200 basepairs. This is probably too strict in your case. So I would change the -w parameter to 1000 to use 1kb sequences for searching. Otherwise, even if your input sequences are 1kb, only 200bp would be used as input.
                        Second is the "peak" size. If you have enough regions smaller than 1kb, I would indeed use these for motif searching. You can later always check the presence of the motif in the larger sequences. Otherwise you can just use all regions as input, as GimmeMotifs will truncate the larger sequences. If there's enough sequences that contain a motif, this should not be that big of a problem.

                        Comment

                        • krespim
                          Member
                          • Jul 2012
                          • 49

                          #13
                          Originally posted by simonvh View Post
                          This is indeed trickier than a typical ChIP-seq run, but most likely not impossible. Basically there's two important things here. First is, the fact that the motif is not located in the center of the peak. Most motif programs that are run by GimmeMotifs do not take the location of the motif in the sequence into account. However, by default GimmeMotifs truncates the input sequences to 200 basepairs. This is probably too strict in your case. So I would change the -w parameter to 1000 to use 1kb sequences for searching. Otherwise, even if your input sequences are 1kb, only 200bp would be used as input.
                          Second is the "peak" size. If you have enough regions smaller than 1kb, I would indeed use these for motif searching. You can later always check the presence of the motif in the larger sequences. Otherwise you can just use all regions as input, as GimmeMotifs will truncate the larger sequences. If there's enough sequences that contain a motif, this should not be that big of a problem.

                          Thanks a lot for the suggestions. After posing the question, I selected regions up to 500bpand also up to 1kb (always setting the -w parameter). And got a similar motifs with both which is comforting. The pwmscan.py also came in handy.

                          Just another couple of things:

                          1. I looked at the manual, could not find a description of the output of pwmscan.py.

                          2. The results I have for my best motif look good from my interpretation of the report. Is this correct? Here are the results:
                          random
                          enrichment 6.00
                          p-value 0.00
                          ROC_AUC 0.703
                          MNCP 4.116

                          genomic_matched
                          enrichment 2.25
                          p-value 0.00
                          ROC_AUC 0.695
                          MNCP 1.808


                          The p-value=0 is the one that is bugging me.

                          Comment

                          • dzavallo
                            Member
                            • Apr 2011
                            • 16

                            #14
                            Dear Simon


                            We are contacting you as user of your gimmemotif pipeline.
                            We are trying to use the roc.py and cluster.py scripts with a file (PWMFILE) which is not derived from gimmemotif. Instead the matrix I am trying to run is composed by results I ve got with another predictor scripts. The error message I ve got in trying to run the ROC script is:

                            comand:
                            gimme roc -o kentaro_roc.pdf kentaro2_julio2016 nuevalista_junio2016.fasta 10000_random_promoters_1500pb_masked_not_E011.fasta

                            error:
                            failed to initialize cache
                            global name 'make_region' is not defined
                            Traceback (most recent call last):
                            File "/tools/anaconda2/bin/gimme", line 469, in <module>
                            args.func(args)
                            File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/commands/roc.py", line 40, in roc
                            for scores in s.best_score(fg_file):
                            File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 270, in best_score
                            for matches in self.scan(seqs, 1, scan_rc, cutoff=0):
                            File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 355, in scan
                            for result in it:
                            File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 418, in _scan_sequences
                            motif_digest = self.checksum[motif_file]
                            KeyError: 'kentaro2_julio2016.txt'


                            In a previous version of gimmemotif, I was able to do this, but I noticed that after GM update the input file (PWMFILE) is not recognized. I attached here the mentioned matrix for you to see whether the error could be.

                            In trying to bypass this trouble, I started from the very beginning running the whole gimmemotif pipeline (including all predictors). However, in the step where I have to give a fasta file with the whole genome sequence to take as background, I failed in indexing the whole tomato genome (my samples are from this species). The error message I ve got in this opportunity is:

                            comand:
                            gimme background -i SL_todoscrom.fa -f SL.fa -g 2.3 -n 1

                            error:
                            background: error: too few arguments


                            Thank you very much in advance for your help with this. Your comments and suggestions are more than welcome.

                            Best wishes,

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Pathogen Surveillance with Advanced Genomic Tools
                              by seqadmin




                              The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                              03-24-2025, 11:48 AM
                            • seqadmin
                              New Genomics Tools and Methods Shared at AGBT 2025
                              by seqadmin


                              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                              The Headliner
                              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                              03-03-2025, 01:39 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-20-2025, 05:03 AM
                            0 responses
                            49 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-19-2025, 07:27 AM
                            0 responses
                            57 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-18-2025, 12:50 PM
                            0 responses
                            49 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-03-2025, 01:15 PM
                            0 responses
                            200 views
                            0 reactions
                            Last Post seqadmin  
                            Working...