Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FindPeaks (ChIP-Seq) update

    Hi All,

    I haven't posted anything here on FindPeaks since the early alpha testing days, so I thought I'd give a quick update.

    First of all, the application is now in a "final" stage, as FindPeaks 3.1.9.2. It can be downloaded here. It comes with manuals and several useful tools for dealing with Short reads, and creating UCSC viewable tracks for genomic data.

    Additionally, it has now been published as a bioinformatics application note (here) The article is open access, so no subscription is required to see it.

    There are several things in the works as well. Features currently in development include: Paired End Tag-ChIP-Seq analysis, processing several new types of aligner data (MAQ .map files, SOAP, etc.), and an improved False Discovery Rate technique.

    Finally, the next version of FindPeaks will also be released as an open source package (tentatively named the Vancouver Short Read Analysis Package) through Sourceforge. The package will also include several other programs that use the same infrastructure for SNP detection, non-synonymous SNP detection, exon/transcript/gene coverage and alternative splicing information, etc. I'll post the link to the full source, once it's available.

    While I also watch this forum and take part in discussions here, I've also set up a mailing list for the FindPeaks application through Sourceforge. If you have any questions about the application, you can continue to email me, or you can now also write to the community of users. (Subscribe here.)

    Happy ChIP-Seq-ing. (=

    Anthony
    The more you know, the more you know you don't know. —Aristotle

  • #2
    A couple algorithm questions

    Hello Anthony,

    Thanks for posting your package. I've read your Applications Note and the User manual and was wondering if you would elaborate a bit on your Monte Carlo based FDR estimation. How does this work. Are you randomizing positions of reads? If so do you avoid repositioning to repetitive regions. How do you calculate the effective genome size?

    Have you characterized the accuracy and robustness of your FDR estimation using different spike-in (or pseudo spike-in) data.

    Lastly, are methods built into FindPeaks to control for false positives due to bias in DNA fragmentation, near repetitive read mapping, PCR amplification, etc? We are seeing lots of "significant" peaks when sequencing just input DNA prior to chIP.

    -Thanks, David

    Comment


    • #3
      Hi David,

      Sorry about the delay in responding.

      I'd be happy to explain the Monte Carlo sampling used in FindPeaks 3.1.9.2 (FP3.1), though I suspect your questions are leading towards where I'm heading with FindPeaks 4.0. I think our understanding of the underlying model is improving rapidly, and has definitely improved from the time that 3.1 was released, until now. With that said, FP3.1 uses a much more Naive model that we currently believe is necessary. Improving this has been a high priority for FP4.0.

      Before I go any further, I should also mention that the "FDR" we implemented in FP 3.1 is really more of a thresholding algorithm than a true FDR - though I believe they can be interpreted the same way.

      Have you characterized the accuracy and robustness of your FDR estimation using different spike-in (or pseudo spike-in) data.
      Unfortunately, no, I've suggested the same thing (testing out internal controls) to the users of the application here at the Genome Science Centre, but haven't gotten any traction. I think it's a useful experiment, but not one we've had a lot of chance to test out, yet. I suspect that it would tell us what we already know: FP 3.1's threshold is very conservative.

      Lastly, are methods built into FindPeaks to control for false positives due to bias in DNA fragmentation, near repetitive read mapping, PCR amplification, etc? We are seeing lots of "significant" peaks when sequencing just input DNA prior to chIP.
      As I mentioned, the thresholding used in FP3.1 only took these into account in a very naive way. We had calculated the "effective length" of the genomes in which we were interested (i.e. what fraction of each chromosome is mappable), and used that as a model for our Monte Carlo. This only assumes that some portion of the genome is "beyond reach", and should not be used in our simulation.

      As for the other problems you're suggesting, I don't think anyone understands them well enough to really model them in any form of simulation, just yet. (i.e. Gottardo and Robertson's PIC's code, which shows that bayesian modeling couldn't make any significant improvements over FindPeaks 2 - I don't believe it was even tested against FP3.1) Certainly, to get any sort of handle on bias in DNA fragmentation or near repetitive read mapping, you would have to do a full simulation, which includes generation of random sequences, a full alignment to the reference genome, and then a FindPeaks alignment. At this point, you no longer have a Monte Carlo simulation, but a full control experiment.

      Indeed, that's one of the major issues with ChIP-Seq: we need to do more controls to better understand the underlying processes which give us noise in the data. This will, in turn, help us create better models for the FDR simulations. For the moment, we handle most of our controls with post-processing scripts, comparing the peaks files between samples and controls, where we can identify many of the biases (eg. "significant peaks" that show up in controls) that you're pointing out.

      Since I'm now working on FP4, I'd be happy to include new models for FDR calculations, if anyone would like to suggest them. (Since I will be open-sourcing the whole code in the fall, I'm happy to accept code contributions or otherwise, which will all be inserted under the GPL.) For the moment, I do have a new, less conservative version, in which I have moved away from the "equal number of reads" model, towards an "equal coverage" model. While it appears to be a significant step forward, it's still not perfect.

      We are seeing lots of "significant" peaks when sequencing just input DNA prior to chIP.
      Yes, we see this in our controls. There are no models that can currently account for this, although control experiments can (and are) used to determine which peaks are "spurious". This is an example of some of the post-processing we've been doing, using the peaks file from FindPeaks.

      I hope that answers your questions.

      Anthony
      The more you know, the more you know you don't know. —Aristotle

      Comment


      • #4
        Empirical data

        Thanks for the info. Consider using input sequencing data to control for systematic bias in picking peaks. No need to model the actual process when you can measure the bias directly. I'd also like to think that the control input data can be reused provided the prep methods are similar.

        Comment


        • #5
          Hi David,

          Sorry if I wasn't clear. We're already doing that, but as a post-processing step after FindPeaks, not as part of the FindPeaks package itself. Once you have the peaks file from a control, it can be used against the peaks files from many other samples.

          Unfortunately, the code is part of a separate set of applications over which I have no control, but there are currently discussions within the GSC on open sourcing these programs as well.

          Anthony
          The more you know, the more you know you don't know. —Aristotle

          Comment


          • #6
            Hi Anthony,
            I really like your software! It has found beautiful peaks for me! But I did not understand what your threshold on minimum peak height means. If it is minimal number of reads per window what is the window length?

            Comment


            • #7
              the threshold on the minimum peak height is used to filter out peaks which do not have a height value crossing the given threshold. For instance, if there is a peak observed with a height of 3, and the threshold minimum was given as 4, this peak would not be included in the peaks file.

              FindPeaks doesn't use a windowing algorithm at all - it considers the coverage at every given base pair in the genome, and identifies peaks (or subpeaks) by identifying every point at which there is a zero slope, and values to either side are smaller. This allows for single base pair accuracy, which you can't get in a window algorithm. (The data itself may not give you single base resolution, but FindPeaks won't cause you to lose any information.)

              The coverage at any point is a function of which distribution type is used - so non-integer numbers are allowed when a weighted length distribution is applied. (The contribution of the read is scaled by the distance from the start point according to the minimum, maximum and average lengths provided.)

              Let me know if anything needs further clarification.
              The more you know, the more you know you don't know. —Aristotle

              Comment


              • #8
                Hi Anthony,
                thanks for your answer! Yes, it became clearer now how you calculate coverage at any given point. And for the whole peaks? Is it a simple sum?

                Last time I forgot to sign

                Very best,
                Valentina

                Comment


                • #9
                  Hi Valentina,

                  I'm not sure what you mean by coverage for a whole peak. the maximum height for a peak is just that - the highest point of a peak. If you are using subpeaks, then the maximum height is for the local maximum, and each local maximum is treated separately.

                  The coverage for each base is a sum over each fragment that overlaps that position. When weighted reads (probability based) are used, this is a float value. If a fixed length (XSET) is used, this is an integer value.

                  Cheers,
                  Anthony
                  The more you know, the more you know you don't know. —Aristotle

                  Comment


                  • #10
                    the highest point of a peak! I should have guessed. Thanks!

                    Cheers,
                    Valentina

                    Comment


                    • #11
                      Hi Anthony,

                      May I ask you to add some options in the new FindPeaks? I could change your code myself it you say that it would take long.

                      What I need is:
                      - to be able to change max and min DNA fragment length, not only the average length
                      - when I filter out duplicates I want to keep two codirectional reads as two copies (I'm not going into details, but I think that it is more fair for ChIP-Seq experiments to do like that. After immunoprecipitation we have double stranded DNA, so one DNA fragment can give two identical codirectional reads with 1/2 probability and two different reads with p=1/2. Since we cannot filter out the second case (we don't know the exact size of DNA fragment) it is unfair to filter out the first case. However, if there are more than 2 copies or they have different direction I want to trow them out.


                      Cheers,
                      Valentina

                      Comment


                      • #12
                        Hi Valentina,

                        The first change you've asked for has already been implemented in FindPeaks 3.2. Unfortunately, the documentation didn't get upgraded to reflect this. I just added this in to the wiki. Sorry about that. (It's part of the -dist_type 1 flag)

                        For your second question, If I understand correctly, you'd like to keep two reads that start and end at the same place, but are on opposite strands. That makes sense. It's not a high priority for me, but I'd be happy to do that. If you get impatient, you're more than welcome to create a patch that does it and submit it to me. If not, it might take me a week or two to get around to it, or if stuff gets in the way, until after I compete my comprehensive exams in mid-December.

                        Cheers,
                        Anthony
                        Last edited by apfejes; 11-03-2008, 12:30 PM. Reason: extra word that made no sense (=
                        The more you know, the more you know you don't know. —Aristotle

                        Comment


                        • #13
                          Hi Anthony! Thanks for your quick reply!

                          I was misled about these duplicate reads.. You are right that it may make sense to keep reads which are on the opposite strands. But I was speaking about co-directional reads. And it seems that I was wrong.. They can not be explained as I proposed.

                          So that means that FindPeaks is fine for me! (I'm incorporating control data just filtering out peaks found in control and it seems that it works fine)

                          Best!
                          Valentina

                          Comment


                          • #14
                            Hey Anthony!

                            Where can I download FindPeaks 3.2? What I've managed to find is FindPeaks 3.1.9.2 and FindPeaks 3.1.9. Both of them output:

                            Ignoring parameter 285
                            Ignoring parameter 315
                            invalid input for -aligner parameter, contininuing using "m"
                            Broken Flag: -maq_read_size

                            for

                            java -jar FindPeaks.jar -name PGU6_Q10 -dist_type 1 300 285 315 -minimum 7 -aligner maq -maq_read_size 64 -directional -eff_frac .8 -duplicatefilter -qualityfilter 10 -hist_size 100 -iterations 20 -landerwaterman -prepend chr -output /xxx/ -input xxx.map

                            Best,
                            Valentina

                            Comment


                            • #15
                              Originally posted by valeu View Post
                              Where can I download FindPeaks 3.2? What I've managed to find is FindPeaks 3.1.9.2 and FindPeaks 3.1.9. Both of them output:
                              Findpeaks3.2 is released? Hurray!


                              Anthony,

                              Is there anything you could recommend to use FindPeaks on for someone who isn't familiar with Linux? We have no UNIX machines in our lab, but I am playing around with VirtualPC, Red Hat Linux, and other options (or at least trying to) in order to test out FindPeaks. It's been very frustrating. Do you know of any simple Linux OS that FindPeaks would work sufficiently on?

                              Best,
                              Kevin

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X