Seqanswers Leaderboard Ad

**apfejes** · 01-14-2009, 12:18 PM

Cool - I won't have time till february, but this sounds neat.

**Chipper** · 01-14-2009, 12:57 PM

Interresting, will definitely try it at some point. Did not understand the positions, should I shift each read +/- 13 bases to get the fragment ends?

It would also be interresting to study how the different programs scores peaks that contain multilple binding sites (or spikes) at a short distance, that is to get the center positions as close to the simulated peak centers as possible under different conditions. Is this something you have considered doing also?

**Nix** · 01-16-2009, 07:23 AM

Yes, if you want the coordinates for a particular read, subtract 13 and add 13 to the given position, interbase coordinates.

We do have the exact center position from which the randomized fragments were generated and could calculate how close a particular call comes to that center.

Chipper, would you mind running this analysis when the lists are in?

**apfejes** · 01-19-2009, 01:33 PM

Will it be possible to get the aligned reads in some raw aligned form (Eland, MAQ, exonerate?) I haven't looked at the files posted yet in any details, but I don't want to have to write an interpreter for whatever format is being used by Useq. (-:

**Nix** · 01-20-2009, 08:12 AM

ELAND Sorted and Export data files

Yes, there are many different formats. We hoped that by providing the simplest (just a position), folks could parse it into something suitable for their favorite application. (USeq uses a binary bar format.)

I have added both ELAND xxx_sorted.txt and xxx_export.txt formatted data sets to the http://bioserver.hci.utah.edu/ChIPSeqSpikeIns directory. Only the chromosome, strand, and position columns have any meaning the others are identical across different rows. The alignment score was set to 74 and the quality boolean to Y. Note, the position was derived by subtracting 12 from the middle positions in the original data files to convert their values into the ELAND coordinate system.

**Nix** · 01-23-2009, 07:42 AM

Peaks or center base? FDRs? Input libraries?

On 1/20/09 11:35 PM, "XU Han" <[email protected]> wrote:

Hi, David:

It’s an interesting challenge. May I ask you two questions regarding the submission?

1. The predicted peak should be a single base (i.e., start=end) or a region (start<end);

2. The FDR refers to the global FDR or local FDR (q-value)?

Also, I noticed that you used an input library to generate the spike-in data, and another input as the control for prediction. Are these two libraries biological replicates or technical replicates?

Han

A response:

Just the chIP regions, not the base.

Hmm, as far as the FDRs, it is probably better to tell you what we want to do with the FDR thresholded lists.

For each FDR thresholded list provided, it will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list). A comparison can then be made between your estimated FDR and the real FDR.

For example, lets say you threshold your binding peaks at a 5% FDR to generate a list of 1500 regions. Of the 1500, only 1000 intersect with the key, the other 500 are false positives. Thus your real FDR for the list would be 500/1500 or 33%. The closer your estimated FDR is to the real FDR the better.

The input data was pooled then randomly split in thirds, to one of the thirds was added the simulated chIP-seq data, the other two thirds were joined to constitute input sample. So, the replicates are neither biological or technical.

**Nix** · 02-23-2009, 11:27 AM

Prizes (iPods!) and ChIP-Seq Categories

Hello Folks,

Both ABI and Illumina have offered prizes to the winners of the contest, see below. Many thanks to these good folks for supporting the community development of bioinformatics.

Get your lists in ASAP, 7 days and counting...

Here are the categories:

1) Best true positive vs. false positive discriminator. The winning method returns the most spike-ins from the contestant's top 500, 1000, and 1500 best hit lists.

2) Best confidence estimator derived from the contestant's 10,5,1,0.1% FDR thresholded lists. The method with the least cumulative sum of fold differences from the actual FDRs will be the winner.

In the event of a tie in a particular category, the associated prize will be awarded to the person whom first submitted their lists. Only one prize per contestant.

Prizes! An iPod Shuffle to the winner of each category with additional items (water bottles, tee shirts, coffee mugs) to 2nd and third place winners. One prize per person.

**chipmaster** · 02-23-2009, 12:51 PM

Originally posted by Nix View Post

Hello Folks,

For each FDR thresholded list provided, it will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list). A comparison can then be made between your estimated FDR and the real FDR.

For example, lets say you threshold your binding peaks at a 5% FDR to generate a list of 1500 regions. Of the 1500, only 1000 intersect with the key, the other 500 are false positives. Thus your real FDR for the list would be 500/1500 or 33%. The closer your estimated FDR is to the real FDR the better.

1) Best true positive vs. false positive discriminator. The winning method returns the most spike-ins from the contestant's top 500, 1000, and 1500 best hit lists.

2) Best confidence estimator derived from the contestant's 10,5,1,0.1% FDR thresholded lists. The method with the least cumulative sum of fold differences from the actual FDRs will be the winner.

In the event of a tie in a particular category, the associated prize will be awarded to the person whom first submitted their lists. Only one prize per contestant.

Prizes! An iPod Shuffle to the winner of each category with additional items (water bottles, tee shirts, coffee mugs) to 2nd and third place winners. One prize per person.

If I understand it correctly, I am not 100% confident that the outlined evaluation criteria makes for an accurate evaluation of submitted lists. It is mentioned that each submitted list "will be intersected with the key and the real FDR for the list calculated (#non intersecting false positives/ # regions in the provided list)." What if I just submit the following as my list. Would the real FDR be zero?

chr1:1-lengthOfChr1
chr2:1-lengthOfChr2
...
...

How is the resolution of the submitted binding regions taken into account while evaluating the submitted list of binding regions? To make sure the submitted sites intersect with the sites in the key, one can just make the submitted sites longer.

How was the key determined? Were experiments conducted to verify each site in the key just to make sure that the sites in the key are indeed true positives? Or, just because a submitted site does not interest with any of the key sites, how do we know if a submitted site is false positive? One can argue that maybe the key is not complete.

Without proper controls, it just may not be right to decide which method is discriminative with simple evaluation criteria outlined in the challenge.

**chipmaster** · 02-23-2009, 12:54 PM

David, You mentioned that you plan to include the submitted lists as a part of a report. By report, do you mean a paper that may be submitted to a journal? If yes, will the participants be included as co-authors?

**Nix** · 02-23-2009, 04:05 PM

Hello ChipMaster,

Yes, submission of huge regions would be one way to cheat. It is also very easy to spot and disqualify. Given the number of spike-ins and their random distribution across the genome. The chance of two spike-ins landing next to one another is very slim thus even if you submitted regions in the 5-10kb range I doubt it would help. That said let's say that each region should be < 1kb. Much more than that and your list will get flagged.

Regarding the key, this is a simulation so we know exactly what was added to the experimentally derived input sequencing data. No need for validation. Anything not added is by definition a false positive (the input data was pooled and randomly split).

We're trying to keep the analysis and the criteria for ranking the methods quite simple.

Regarding the initial report, yes, anyone who submits a list or makes a substantial contribution would be an author. Whether the report rises to the level of a publication will need to be seen.

**jsp** · 02-24-2009, 09:33 AM

Hello David,

It's hard to judge the performance of a method for FDR, b/c most methods can identify the top peaks (say 100) with relatively low FDR, and then I can generate other FDR lists by replacing some of these peaks (at the bottom) with trashes. For example, generating FDR 5% (or 10%) by replacing 5 (or 10) of these top 100 peaks with 5 or 10 "trashes".
Will I get an iPod from this?

John

**Nix** · 02-24-2009, 09:55 AM

Don't know if I can help you here. The FDR estimations I've used are typically tied to a threshold and can be used to filter a list of putative peaks. Relaxing the threshold increases the FDR.

Regarding the iPods, ya can't win if ya don't play so be sure to submit some lists! -cheers, D

**chipmaster** · 02-24-2009, 10:27 AM

Originally posted by Nix View Post

Hello ChipMaster,
That said let's say that each region should be < 1kb.

The only reason people use ChIP-Seq over ChIP-chip is that it provides higher resolution. 1 Kb upper limit does not make sense. Since the sequenced DNA fragments are ~200-500 bp in most cases, without having to use any program, one should be able to pinpoint the peaks (enriched regions) with a ~200-500 bp resolution. Any program that improves upon this should at least make sure to narrow down the region (based on the tag directions) to a few tens of base pairs. Given this, I would argue that the regions cannot be more than 100 or 200bp.

**bioinfosm** · 02-24-2009, 11:01 AM

Originally posted by chipmaster View Post

The only reason people use ChIP-Seq over ChIP-chip is that it provides higher resolution. 1 Kb upper limit does not make sense. Since the sequenced DNA fragments are ~200-500 bp in most cases, without having to use any program, one should be able to pinpoint the peaks (enriched regions) with a ~200-500 bp resolution. Any program that improves upon this should at least make sure to narrow down the region (based on the tag directions) to a few tens of base pairs. Given this, I would argue that the regions cannot be more than 100 or 200bp.

Yes the initial fragments are 200-500, but peak calling from the depth of coverage does yield wider peaks, which can be narrowed by parameters like tapering off the shoulder etc..

Topics	Statistics	Last Post
Artificial Intelligence Shows Promise in Automating Functional Genomics Analysis by seqadmin Started by seqadmin, Yesterday, 09:29 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 09:29 AM
Rethinking Epigenetic Clocks for Accurate Aging Measurements by seqadmin Started by seqadmin, Yesterday, 09:06 AM	0 responses 11 views 0 likes	Last Post by seqadmin Yesterday, 09:06 AM
Researchers Capture Ribosome-mRNA Interactions in Atomic Detail by seqadmin Started by seqadmin, Yesterday, 08:03 AM	0 responses 11 views 0 likes	Last Post by seqadmin Yesterday, 08:03 AM
Single-Cell Sequencing Links Early Genetic Mutations to Breast Cancer Development by seqadmin Started by seqadmin, 11-22-2024, 07:36 AM	0 responses 65 views 0 likes	Last Post by seqadmin 11-22-2024, 07:36 AM

Seqanswers Leaderboard Ad

Announcement

ChIP-Seq Challenge

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News