Community ChIP-Seq Challenge 1.0
Hello Folks,
We need your help! Yes you!
Here is an experiment in open community development. We are not sure it will work but hope it will help with a growing problem….
Given the dozen or so ChIP-Seq analysis applications currently available, we would like to know which algorithms are the best with respect to 1) identifying real ChIP-Seq peaks and 2) estimating confidence in them with a false discovery rate.
We propose a series of tests using spike-in datasets where known truth can be used to objectively measure which methods work well under different conditions.
Towards this end, we have created a spike-in dataset where simulated ChIP-Seq reads were added to experimentally derived input Illumina Genome Analyzer sequence data. Additional input data without spike-ins is also available for use as an input control.
It is our request that users (and developers) of particular ChIP-Seq packages download the data, analyze it, and post their lists of ChIP-Seq peaks along side a detailed description of how they processed the data.
Multiple submissions using the same analysis package from multiple users are encouraged.
It is our hope that this open community experiment will help clarify which analysis packages work well under different conditions and foster continued development of ChIP-Seq algorithms.
So download the data, run it through your favorite ChIP-Seq detector, and publicly post and/or privately submit your lists to us by March 2nd.
Best regards,
David Nix
The Huntsman Cancer Institute and
University of Utah Bioinformatics Shared Resource Center
http://bioserver.hci.utah.edu [email protected]
Details:
1) A combine pool of mapped sequencing data from human Jurkat T-cell input chromatin DNA from Valouev et al (Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008 Aug 17. http://mendel.stanford.edu/sidowlab/...ign_25.hg18.gz) and from the Graves’ lab (Hollenhorst, P and Graves, B unpublished) was merged, randomized, and split in 1/3rd and 2/3rd samples, 10 million and 20 million reads respectively.
2) The USeq Simulator application (http://useq.sourceforge.net/cmdLnMenus.html#Simulator) was used to generate simulated spike-ins. The reads were aligned to the genome using the stand-alone ELAND aligner. Spike-in regions where > ¼ of the reads mapped were used to randomly select a specific number of reads to represent spikes of different concentrations. These were added to the 1/3rd input sample and constitute the ChIP-Seq sample.
3) A few comments, hundreds of spikes have been added, their size range was selected to closely approximate a real size selected ChIP-Seq experiment. The strand of the mapped reads has been preserved. The read positions have not been shifted to compensate for the length of the fragments but simply assigned to the center of the 26bp mapped read. Reads that mapped to multiple locations were mapped following the ELAND aligner’s default parameters.
4) The key will be made immediately available to anyone who submits lists of ChIP-Seq peaks and promises not to distribute the key until after April 1.
5) Seven lists should be provided each ranked best to worst and generated by setting FDR thresholds of 20%, 10%, 5%, 1%, 0.1%, 0.05%, and 0.01%. These should be in bed file format (tab delimited: chrom, start, stop, name, score; e.g. ‘chrX 3599643 3599943 peak37 3219’). Additionally (or alternatively if FDRs cannot be estimated), provide three ranked lists containing the top 500, 1000, and 1500 putative ChIP-Seq peaks. Multiple list sets are acceptable (e.g. one set with strand skew filtering, one without).
6) A description should be provided with the lists describing how the data was processed with sufficient detail for someone else to be able to replicate your results (e.g. command lines and or all application parameters).
7) The key will be publicly released on April 1st.
8) Submissions should be made to [email protected] or publicly posted by March 2nd for inclusion in a summary report. Multiple submissions are encouraged, both pre and post key.
9) The data can be downloaded from http://bioserver.hci.utah.edu/ChIPSeqSpikeIns . It is split by sample, strand, and chromosome. Each text file contains a column of base positions (H_sapiens_Mar_2006, hg18) representing the center of each mapped read. See the CCS1.0_Text.zip file. (The data is also available in PointData bar format for direct visualization in the Integrated Genome Browser and for use in USeq applications. See the CCS1.0_PointData_ForUSeq.zip file.)
10) Let us know if you need help reformatting the data for analysis.
Hello Folks,
We need your help! Yes you!
Here is an experiment in open community development. We are not sure it will work but hope it will help with a growing problem….
Given the dozen or so ChIP-Seq analysis applications currently available, we would like to know which algorithms are the best with respect to 1) identifying real ChIP-Seq peaks and 2) estimating confidence in them with a false discovery rate.
We propose a series of tests using spike-in datasets where known truth can be used to objectively measure which methods work well under different conditions.
Towards this end, we have created a spike-in dataset where simulated ChIP-Seq reads were added to experimentally derived input Illumina Genome Analyzer sequence data. Additional input data without spike-ins is also available for use as an input control.
It is our request that users (and developers) of particular ChIP-Seq packages download the data, analyze it, and post their lists of ChIP-Seq peaks along side a detailed description of how they processed the data.
Multiple submissions using the same analysis package from multiple users are encouraged.
It is our hope that this open community experiment will help clarify which analysis packages work well under different conditions and foster continued development of ChIP-Seq algorithms.
So download the data, run it through your favorite ChIP-Seq detector, and publicly post and/or privately submit your lists to us by March 2nd.
Best regards,
David Nix
The Huntsman Cancer Institute and
University of Utah Bioinformatics Shared Resource Center
http://bioserver.hci.utah.edu [email protected]
Details:
1) A combine pool of mapped sequencing data from human Jurkat T-cell input chromatin DNA from Valouev et al (Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008 Aug 17. http://mendel.stanford.edu/sidowlab/...ign_25.hg18.gz) and from the Graves’ lab (Hollenhorst, P and Graves, B unpublished) was merged, randomized, and split in 1/3rd and 2/3rd samples, 10 million and 20 million reads respectively.
2) The USeq Simulator application (http://useq.sourceforge.net/cmdLnMenus.html#Simulator) was used to generate simulated spike-ins. The reads were aligned to the genome using the stand-alone ELAND aligner. Spike-in regions where > ¼ of the reads mapped were used to randomly select a specific number of reads to represent spikes of different concentrations. These were added to the 1/3rd input sample and constitute the ChIP-Seq sample.
3) A few comments, hundreds of spikes have been added, their size range was selected to closely approximate a real size selected ChIP-Seq experiment. The strand of the mapped reads has been preserved. The read positions have not been shifted to compensate for the length of the fragments but simply assigned to the center of the 26bp mapped read. Reads that mapped to multiple locations were mapped following the ELAND aligner’s default parameters.
4) The key will be made immediately available to anyone who submits lists of ChIP-Seq peaks and promises not to distribute the key until after April 1.
5) Seven lists should be provided each ranked best to worst and generated by setting FDR thresholds of 20%, 10%, 5%, 1%, 0.1%, 0.05%, and 0.01%. These should be in bed file format (tab delimited: chrom, start, stop, name, score; e.g. ‘chrX 3599643 3599943 peak37 3219’). Additionally (or alternatively if FDRs cannot be estimated), provide three ranked lists containing the top 500, 1000, and 1500 putative ChIP-Seq peaks. Multiple list sets are acceptable (e.g. one set with strand skew filtering, one without).
6) A description should be provided with the lists describing how the data was processed with sufficient detail for someone else to be able to replicate your results (e.g. command lines and or all application parameters).
7) The key will be publicly released on April 1st.
8) Submissions should be made to [email protected] or publicly posted by March 2nd for inclusion in a summary report. Multiple submissions are encouraged, both pre and post key.
9) The data can be downloaded from http://bioserver.hci.utah.edu/ChIPSeqSpikeIns . It is split by sample, strand, and chromosome. Each text file contains a column of base positions (H_sapiens_Mar_2006, hg18) representing the center of each mapped read. See the CCS1.0_Text.zip file. (The data is also available in PointData bar format for direct visualization in the Integrated Genome Browser and for use in USeq applications. See the CCS1.0_PointData_ForUSeq.zip file.)
10) Let us know if you need help reformatting the data for analysis.
Comment