SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-Seq Analysis Challenge RockChalkJayhawk Bioinformatics 16 01-27-2012 10:01 PM
ChIP-Seq: ChIP-Array: combinatory analysis of ChIP-seq/chip and microarray gene expre Newsbot! Literature Watch 0 05-19-2011 02:50 AM
ChIP-Seq: ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysi Newsbot! Literature Watch 0 03-02-2011 02:50 AM
RNA-Seq: Deep sequencing-based transcriptome profiling analysis of bacteria-challenge Newsbot! Literature Watch 0 08-17-2010 02:00 AM
ChIP-Seq: ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip dat Newsbot! Literature Watch 0 05-13-2010 02:00 AM

Reply
 
Thread Tools
Old 01-13-2009, 11:13 AM   #1
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default ChIP-Seq Challenge

Community ChIP-Seq Challenge 1.0

Hello Folks,

We need your help! Yes you!

Here is an experiment in open community development. We are not sure it will work but hope it will help with a growing problem….

Given the dozen or so ChIP-Seq analysis applications currently available, we would like to know which algorithms are the best with respect to 1) identifying real ChIP-Seq peaks and 2) estimating confidence in them with a false discovery rate.

We propose a series of tests using spike-in datasets where known truth can be used to objectively measure which methods work well under different conditions.

Towards this end, we have created a spike-in dataset where simulated ChIP-Seq reads were added to experimentally derived input Illumina Genome Analyzer sequence data. Additional input data without spike-ins is also available for use as an input control.

It is our request that users (and developers) of particular ChIP-Seq packages download the data, analyze it, and post their lists of ChIP-Seq peaks along side a detailed description of how they processed the data.

Multiple submissions using the same analysis package from multiple users are encouraged.

It is our hope that this open community experiment will help clarify which analysis packages work well under different conditions and foster continued development of ChIP-Seq algorithms.

So download the data, run it through your favorite ChIP-Seq detector, and publicly post and/or privately submit your lists to us by March 2nd.

Best regards,

David Nix

The Huntsman Cancer Institute and
University of Utah Bioinformatics Shared Resource Center
http://bioserver.hci.utah.edu [email protected]


Details:

1) A combine pool of mapped sequencing data from human Jurkat T-cell input chromatin DNA from Valouev et al (Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008 Aug 17. http://mendel.stanford.edu/sidowlab/...ign_25.hg18.gz) and from the Graves’ lab (Hollenhorst, P and Graves, B unpublished) was merged, randomized, and split in 1/3rd and 2/3rd samples, 10 million and 20 million reads respectively.

2) The USeq Simulator application (http://useq.sourceforge.net/cmdLnMenus.html#Simulator) was used to generate simulated spike-ins. The reads were aligned to the genome using the stand-alone ELAND aligner. Spike-in regions where > ¼ of the reads mapped were used to randomly select a specific number of reads to represent spikes of different concentrations. These were added to the 1/3rd input sample and constitute the ChIP-Seq sample.

3) A few comments, hundreds of spikes have been added, their size range was selected to closely approximate a real size selected ChIP-Seq experiment. The strand of the mapped reads has been preserved. The read positions have not been shifted to compensate for the length of the fragments but simply assigned to the center of the 26bp mapped read. Reads that mapped to multiple locations were mapped following the ELAND aligner’s default parameters.

4) The key will be made immediately available to anyone who submits lists of ChIP-Seq peaks and promises not to distribute the key until after April 1.

5) Seven lists should be provided each ranked best to worst and generated by setting FDR thresholds of 20%, 10%, 5%, 1%, 0.1%, 0.05%, and 0.01%. These should be in bed file format (tab delimited: chrom, start, stop, name, score; e.g. ‘chrX 3599643 3599943 peak37 3219’). Additionally (or alternatively if FDRs cannot be estimated), provide three ranked lists containing the top 500, 1000, and 1500 putative ChIP-Seq peaks. Multiple list sets are acceptable (e.g. one set with strand skew filtering, one without).

6) A description should be provided with the lists describing how the data was processed with sufficient detail for someone else to be able to replicate your results (e.g. command lines and or all application parameters).

7) The key will be publicly released on April 1st.

8) Submissions should be made to [email protected] or publicly posted by March 2nd for inclusion in a summary report. Multiple submissions are encouraged, both pre and post key.

9) The data can be downloaded from http://bioserver.hci.utah.edu/ChIPSeqSpikeIns . It is split by sample, strand, and chromosome. Each text file contains a column of base positions (H_sapiens_Mar_2006, hg18) representing the center of each mapped read. See the CCS1.0_Text.zip file. (The data is also available in PointData bar format for direct visualization in the Integrated Genome Browser and for use in USeq applications. See the CCS1.0_PointData_ForUSeq.zip file.)

10) Let us know if you need help reformatting the data for analysis.
Nix is offline   Reply With Quote
Old 01-14-2009, 11:18 AM   #2
apfejes
Senior Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 236
Default

Cool - I won't have time till february, but this sounds neat.
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 01-14-2009, 11:57 AM   #3
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 287
Default

Interresting, will definitely try it at some point. Did not understand the positions, should I shift each read +/- 13 bases to get the fragment ends?

It would also be interresting to study how the different programs scores peaks that contain multilple binding sites (or spikes) at a short distance, that is to get the center positions as close to the simulated peak centers as possible under different conditions. Is this something you have considered doing also?
Chipper is offline   Reply With Quote
Old 01-16-2009, 06:23 AM   #4
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default

Yes, if you want the coordinates for a particular read, subtract 13 and add 13 to the given position, interbase coordinates.

We do have the exact center position from which the randomized fragments were generated and could calculate how close a particular call comes to that center.

Chipper, would you mind running this analysis when the lists are in?
Nix is offline   Reply With Quote
Old 01-19-2009, 12:33 PM   #5
apfejes
Senior Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 236
Default

Will it be possible to get the aligned reads in some raw aligned form (Eland, MAQ, exonerate?) I haven't looked at the files posted yet in any details, but I don't want to have to write an interpreter for whatever format is being used by Useq. (-:
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 01-20-2009, 07:12 AM   #6
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default ELAND Sorted and Export data files

Yes, there are many different formats. We hoped that by providing the simplest (just a position), folks could parse it into something suitable for their favorite application. (USeq uses a binary bar format.)

I have added both ELAND xxx_sorted.txt and xxx_export.txt formatted data sets to the http://bioserver.hci.utah.edu/ChIPSeqSpikeIns directory. Only the chromosome, strand, and position columns have any meaning the others are identical across different rows. The alignment score was set to 74 and the quality boolean to Y. Note, the position was derived by subtracting 12 from the middle positions in the original data files to convert their values into the ELAND coordinate system.
Nix is offline   Reply With Quote
Old 01-23-2009, 06:42 AM   #7
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default Peaks or center base? FDRs? Input libraries?

On 1/20/09 11:35 PM, "XU Han" <[email protected]> wrote:

Hi, David:

It’s an interesting challenge. May I ask you two questions regarding the submission?

1. The predicted peak should be a single base (i.e., start=end) or a region (start<end);

2. The FDR refers to the global FDR or local FDR (q-value)?

Also, I noticed that you used an input library to generate the spike-in data, and another input as the control for prediction. Are these two libraries biological replicates or technical replicates?

Han


A response:

Just the chIP regions, not the base.

Hmm, as far as the FDRs, it is probably better to tell you what we want to do with the FDR thresholded lists.

For each FDR thresholded list provided, it will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list). A comparison can then be made between your estimated FDR and the real FDR.

For example, lets say you threshold your binding peaks at a 5% FDR to generate a list of 1500 regions. Of the 1500, only 1000 intersect with the key, the other 500 are false positives. Thus your real FDR for the list would be 500/1500 or 33%. The closer your estimated FDR is to the real FDR the better.

The input data was pooled then randomly split in thirds, to one of the thirds was added the simulated chIP-seq data, the other two thirds were joined to constitute input sample. So, the replicates are neither biological or technical.
Nix is offline   Reply With Quote
Old 02-23-2009, 10:27 AM   #8
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default Prizes (iPods!) and ChIP-Seq Categories

Hello Folks,

Both ABI and Illumina have offered prizes to the winners of the contest, see below. Many thanks to these good folks for supporting the community development of bioinformatics.

Get your lists in ASAP, 7 days and counting...

Here are the categories:

1) Best true positive vs. false positive discriminator. The winning method returns the most spike-ins from the contestant's top 500, 1000, and 1500 best hit lists.

2) Best confidence estimator derived from the contestant's 10,5,1,0.1% FDR thresholded lists. The method with the least cumulative sum of fold differences from the actual FDRs will be the winner.

In the event of a tie in a particular category, the associated prize will be awarded to the person whom first submitted their lists. Only one prize per contestant.

Prizes! An iPod Shuffle to the winner of each category with additional items (water bottles, tee shirts, coffee mugs) to 2nd and third place winners. One prize per person.
Nix is offline   Reply With Quote
Old 02-23-2009, 11:51 AM   #9
chipmaster
Junior Member
 
Location: Mid-atlantic

Join Date: Feb 2009
Posts: 3
Question

Quote:
Originally Posted by Nix View Post
Hello Folks,

For each FDR thresholded list provided, it will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list). A comparison can then be made between your estimated FDR and the real FDR.

For example, lets say you threshold your binding peaks at a 5% FDR to generate a list of 1500 regions. Of the 1500, only 1000 intersect with the key, the other 500 are false positives. Thus your real FDR for the list would be 500/1500 or 33%. The closer your estimated FDR is to the real FDR the better.

1) Best true positive vs. false positive discriminator. The winning method returns the most spike-ins from the contestant's top 500, 1000, and 1500 best hit lists.

2) Best confidence estimator derived from the contestant's 10,5,1,0.1% FDR thresholded lists. The method with the least cumulative sum of fold differences from the actual FDRs will be the winner.

In the event of a tie in a particular category, the associated prize will be awarded to the person whom first submitted their lists. Only one prize per contestant.

Prizes! An iPod Shuffle to the winner of each category with additional items (water bottles, tee shirts, coffee mugs) to 2nd and third place winners. One prize per person.

If I understand it correctly, I am not 100% confident that the outlined evaluation criteria makes for an accurate evaluation of submitted lists. It is mentioned that each submitted list "will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list)." What if I just submit the following as my list. Would the real FDR be zero?

chr1:1-lengthOfChr1
chr2:1-lengthOfChr2
...
...

How is the resolution of the submitted binding regions taken into account while evaluating the submitted list of binding regions? To make sure the submitted sites intersect with the sites in the key, one can just make the submitted sites longer.

How was the key determined? Were experiments conducted to verify each site in the key just to make sure that the sites in the key are indeed true positives? Or, just because a submitted site does not interest with any of the key sites, how do we know if a submitted site is false positive? One can argue that maybe the key is not complete.

Without proper controls, it just may not be right to decide which method is discriminative with simple evaluation criteria outlined in the challenge.
chipmaster is offline   Reply With Quote
Old 02-23-2009, 11:54 AM   #10
chipmaster
Junior Member
 
Location: Mid-atlantic

Join Date: Feb 2009
Posts: 3
Default

David, You mentioned that you plan to include the submitted lists as a part of a report. By report, do you mean a paper that may be submitted to a journal? If yes, will the participants be included as co-authors?
chipmaster is offline   Reply With Quote
Old 02-23-2009, 03:05 PM   #11
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default

Hello ChipMaster,

Yes, submission of huge regions would be one way to cheat. It is also very easy to spot and disqualify. Given the number of spike-ins and their random distribution across the genome. The chance of two spike-ins landing next to one another is very slim thus even if you submitted regions in the 5-10kb range I doubt it would help. That said let's say that each region should be < 1kb. Much more than that and your list will get flagged.

Regarding the key, this is a simulation so we know exactly what was added to the experimentally derived input sequencing data. No need for validation. Anything not added is by definition a false positive (the input data was pooled and randomly split).

We're trying to keep the analysis and the criteria for ranking the methods quite simple.

Regarding the initial report, yes, anyone who submits a list or makes a substantial contribution would be an author. Whether the report rises to the level of a publication will need to be seen.
Nix is offline   Reply With Quote
Old 02-24-2009, 08:33 AM   #12
jsp
Junior Member
 
Location: USA

Join Date: Nov 2008
Posts: 5
Default

Hello David,

It's hard to judge the performance of a method for FDR, b/c most methods can identify the top peaks (say 100) with relatively low FDR, and then I can generate other FDR lists by replacing some of these peaks (at the bottom) with trashes. For example, generating FDR 5% (or 10%) by replacing 5 (or 10) of these top 100 peaks with 5 or 10 "trashes".
Will I get an iPod from this?

John
jsp is offline   Reply With Quote
Old 02-24-2009, 08:55 AM   #13
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default

Don't know if I can help you here. The FDR estimations I've used are typically tied to a threshold and can be used to filter a list of putative peaks. Relaxing the threshold increases the FDR.

Regarding the iPods, ya can't win if ya don't play so be sure to submit some lists! -cheers, D
Nix is offline   Reply With Quote
Old 02-24-2009, 09:27 AM   #14
chipmaster
Junior Member
 
Location: Mid-atlantic

Join Date: Feb 2009
Posts: 3
Default

Quote:
Originally Posted by Nix View Post
Hello ChipMaster,
That said let's say that each region should be < 1kb.
The only reason people use ChIP-Seq over ChIP-chip is that it provides higher resolution. 1 Kb upper limit does not make sense. Since the sequenced DNA fragments are ~200-500 bp in most cases, without having to use any program, one should be able to pinpoint the peaks (enriched regions) with a ~200-500 bp resolution. Any program that improves upon this should at least make sure to narrow down the region (based on the tag directions) to a few tens of base pairs. Given this, I would argue that the regions cannot be more than 100 or 200bp.
chipmaster is offline   Reply With Quote
Old 02-24-2009, 10:01 AM   #15
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 481
Default

Quote:
Originally Posted by chipmaster View Post
The only reason people use ChIP-Seq over ChIP-chip is that it provides higher resolution. 1 Kb upper limit does not make sense. Since the sequenced DNA fragments are ~200-500 bp in most cases, without having to use any program, one should be able to pinpoint the peaks (enriched regions) with a ~200-500 bp resolution. Any program that improves upon this should at least make sure to narrow down the region (based on the tag directions) to a few tens of base pairs. Given this, I would argue that the regions cannot be more than 100 or 200bp.
Yes the initial fragments are 200-500, but peak calling from the depth of coverage does yield wider peaks, which can be narrowed by parameters like tapering off the shoulder etc..
bioinfosm is offline   Reply With Quote
Old 03-03-2009, 08:27 AM   #16
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default Deadline extension

Hello Folks,

Several groups have asked for an extension thus, the official submission deadline is now next Tuesday, the 10th of March. Many thanks to those that have submitted processed data and to our sponsors, Illumina and Applied Biosystems.

We still need your help (yes you) to make this a success. You might even win an iPod and get published for your efforts.

If you would, download the data and run it through one of the missing chIP-Seq packages below.

-cheers, David



Current submissions:

MACS - Mali Salmon-Divon, Tao Liu
Cisgenome - Hongkai Ji
Novel Method - John (Shouyong) Peng
Genomatix Genome Analyzer - Nancy Bretschneider
SWEMBL - Steven Wilder
ERANGE3.0 - Shirley Pepke
Partek - Justin Brown
USeq - David Nix


Missing submissions:

QuEST - http://mendel.stanford.edu/sidowlab/downloads/quest/ - [email protected]

WTD/MSP/MTC - http://compbio.med.harvard.edu/Supplements/ChIP-seq - [email protected] joe[email protected]

F-Seq - http://www.genome.duke.edu/labs/furey/software/fseq - [email protected]

SISSRs - http://www.rajajothi.com/sissrs/ - [email protected] [email protected]

ChromaSig - http://bioinformatics-renlab.ucsd.ed...wiki/ChromaSig - [email protected] [email protected]

ChIPDiff - http://bioinformatics.oxfordjournals...act/24/20/2344 - [email protected]

PoissonMixtureModel - http://www.biomedcentral.com/1471-2164/9/S2/S23 [email protected]

ChIP-Seq - http://www.isrec.isb-sib.ch/chipseq/ - sourceforge, contact through web form

chip-seq - R library - [email protected] [email protected]

Illumina Genome Studio - [email protected]

FindPeaks - http://vancouvershortr.sourceforge.net/ - [email protected]
Nix is offline   Reply With Quote
Old 03-16-2009, 08:12 AM   #17
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default Submissions

Posted all of the submissions to http://bioserver.hci.utah.edu/ChIPSeqSpikeIns/ . 13 in total, a great response. Will post the key on April 1st along side a preliminary analysis summary.
Nix is offline   Reply With Quote
Old 04-01-2009, 01:55 PM   #18
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default Results of the Contest

Hello Folks,

I've posted some draft results as well as the key to http://bioserver.hci.utah.edu/ChIPSeqSpikeIns/ . Comments and constructive criticism are appreciated. Should finalize this in a week.

Congrats and thanks to all the participants. -cheers, David
Nix is offline   Reply With Quote
Old 04-02-2009, 07:48 AM   #19
jsp
Junior Member
 
Location: USA

Join Date: Nov 2008
Posts: 5
Default

Hello David,

I saw some top 500 peak list can identify 501 key regions, and this doesn’t make any sense to me. The reason is either two key regions overlapping two much or identified peak region is too big. So I propose the following suggestions here:

1. Cleaner key regions -- for neighboring key regions with too much overlaps (for example more than 40%), they should be merged into a single key region. (A good method should be able to identify key regions with some limited amount of overlapping, and that might be the theme for Community ChIP-Seq Challenge 2.0?)

2. A more objective criteria (related to the resolution of the submitted binding regions) – take the midpoint of each identified peak region and check whether it falls within a key region. ChipMaster raised a question about submitting a list with “chr1:1-lengthOfChr1” before, and “1kb rule” still favors to results with larger peak regions.

3. The above two is to avoid cases that one peak covers two key regions, we also need to avoid the cases that a single key region is identified multiple times by small peaks (I don’t know whether this has been taken care of already).

It will be interesting to see the distribution of distances b/t the identified peak centers and their corresponding key region centers.

Please change “ParkLab” to “BPC” (which stands for binding profile construction) in the report. My lab mate published a package (spp:
http://compbio.med.harvard.edu/Supplements/ChIP-seq/) on ChIP-seq peak detection, and it performed really really well on many published real ChIP-seq data sets. I hope that my participation of this challenge with my beta version of BPC won’t mislead people to think it’s the best method from Park Lab.

Thanks for putting all these together.

Looking forward to challenge 2.0
jsp is offline   Reply With Quote
Old 04-02-2009, 08:13 AM   #20
Nix
Member
 
Location: SLC, Utah

Join Date: Jun 2008
Posts: 60
Default

JSP, you are correct there are a couple key regions in close proximity that can be intersected by one candidate, thus it is possible to hit 501 key regions in the top 500 list.

As far as I am aware folks candidate regions aren't excessively large, all under 500bp.

The number of double hits are minor and won't effect the overall results.

And no, multiple hits to the same key only count once.

I'll put together a list of the actual centers used to generate the random fragments and let those interested calculate the intersections. There are several problems with this approach, namely the observed center is not the same as the actual center since read distribution is skewed by the presence of poorly alignable repeats and low complexity regions. Which do you use? Again, I very much doubt it will change the overall results.

As for additional methods, by all means run them using the simulated data and I can add them to the charts.
Nix is offline   Reply With Quote
Reply

Tags
chip-seq, spike-in

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:33 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.