SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-Seq for Promotor and TSS finding sphil RNA Sequencing 1 07-04-2013 11:35 PM
ChIP-Seq: ChIP-Array: combinatory analysis of ChIP-seq/chip and microarray gene expre Newsbot! Literature Watch 0 05-19-2011 02:50 AM
ChIP-Seq: ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysi Newsbot! Literature Watch 0 03-02-2011 02:50 AM
distance measure to compare peak set profiles in chip-seq datasets avilella Bioinformatics 0 03-18-2010 02:01 AM
ChIP-Seq reads correlated/distance to with TSS/promoter etc. seqfast Bioinformatics 13 10-07-2008 03:50 PM

Reply
 
Thread Tools
Old 11-28-2011, 06:07 PM   #1
Jeannine
Member
 
Location: Australia

Join Date: Sep 2009
Posts: 14
Default ChIP-seq: distance to TSS

Hi everyone,

First I have to apologize for a probably stupid question, but I'm an absolute beginner in ChIP-seq data analysis. I tried to generate a graph showing the distance to TSS of my reads. I used the ENSEMBL TSS for this analysis. But now I'm not sure if I should use all the different transcripts (and hence TSS) for each gene or how do I pick the "most expressed" or "most likely" transcript?
What is the right way to do it, or is there a way to only get the TSS of expressed/functional/main transcripts?

Thanks,
Jeannine
Jeannine is offline   Reply With Quote
Old 11-28-2011, 10:59 PM   #2
mudshark
Senior Member
 
Location: Munich

Join Date: Jan 2009
Posts: 138
Default

you simply need a list of expressed genes of your model system determined using RNASeq or expression microarray (look in the databases) and filter your TSS list.
mudshark is offline   Reply With Quote
Old 11-28-2011, 11:21 PM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

I assume you are referring to the TSS profile plots as used e.g. in Fig. 2 of Barksi et al. (2007).

To get these, one usually takes a list of all TSSs and the find the reads close to the TSSs. Conceptually, you cut out the the coverage curves of your ChIP-Seq data in windows around all the TSSs, stack them on top of each other and add them up. See this thread for some options on how to do this.

If you take all TSSs, your profile will put most weight on those TSSs which have many ChIP-Seq reads closeby. If you want to get a more detailed loo, you typically stratify the TSSs, e.g., by expression strength (which you get from RNA-Seq or microarray assays) or any other feature you hypothetize to influence your binding.
Simon Anders is offline   Reply With Quote
Old 11-29-2011, 01:50 AM   #4
ffinkernagel
Senior Member
 
Location: Marburg, Germany

Join Date: Oct 2009
Posts: 110
Default

I think Jeannine wants a plot showing the distance of each peak to the closest (annotated) TSS - ie. a histogram.

For this she has been using the Ensembl gene starts, but is wondering whether to use the Ensembl transcript starts instead.

Personally, I've come around to using the transcript starts - there are many instances where the ensembl genes are extended upstream from the refseq annotation and the 'internal' TSS is just as (or even more) sensible.
I don't select among the TSS from a single gene though - I use every single (distinct) TSS for calculating the closest one to a given binding region.
ffinkernagel is offline   Reply With Quote
Old 11-29-2011, 02:39 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

In a way, the TSS plots that I am talking about, approximately are histograms of distances to the closest TSS, with 1 bp bin size. So we are talking about the same thing. The trick is that you do not necessarily need to select one TSS per gene or take only the closest. If you take all the picture hardly changes, because the contribution from the "wrong" ones just adds uniform background noise, which sort of "lifts up" the curve without changing its shape. I'd probably have to draw a few pictures to make this clearer.
Simon Anders is offline   Reply With Quote
Old 11-29-2011, 02:53 AM   #6
ffinkernagel
Senior Member
 
Location: Marburg, Germany

Join Date: Oct 2009
Posts: 110
Default

Simon, I don't follow. You plot coverage, which might be up to 72 in a deduplicated 36 bp chip-seq experiment.
But for the distance histogram plot, I'd add just one count per binding region (let's say we're using the summit).
Plus the Distance histogram is essentially centered around binding regions, while your plot is centered around the TSS.
How does one degenerate into the other?

In my opinion, using less transcription start sites ('gene starts') will bias the histogram to be more flat - certainly the average distance to the next TSS will rise.
ffinkernagel is offline   Reply With Quote
Old 11-29-2011, 03:00 AM   #7
mudshark
Senior Member
 
Location: Munich

Join Date: Jan 2009
Posts: 138
Default

sorry I did not get it in the first place.

I think that in order to appreciate all alternative TSS you should know which of the ones are actually used in the model system you are analyzing. Therefore you need for example RNA Polymerase or RNA Seq data to define the true TS start(s).

On the other hand I agree with Simon, if you just go for the annotated gene start, you probably do not do much harm to the analysis. What's the fraction of wrongly annotated gene starts anyway?
mudshark is offline   Reply With Quote
Old 11-29-2011, 03:02 AM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Ok, you were talking of peaks, not reads. Sorry, I didn't read this properly.

I had histone modifications in mind, such as in the Barski et al. paper, and there, peak calling will not work well, as these marks are often quite stretched out. Still, as you can see in this and similar papers, these "TSS profile" plots are quite informative for histone marks. I even think they may still useful in the case of more sharply peaked features such as TF binding sites, because it allows for an analysis without using a peak finder and so can sidestep issues relating to peak finding tuning parameters.
Simon Anders is offline   Reply With Quote
Old 11-29-2011, 03:32 AM   #9
ffinkernagel
Senior Member
 
Location: Marburg, Germany

Join Date: Oct 2009
Posts: 110
Default

@simon: no harm done - just wanted to be clear whether I'd imagined the difference .

I also agree with you, for many datasets profile plots are more informative - especially since many TF peaks will overlap the transcription start site and you'll either use an arbitrary position for each one (such as the summit) or assign them to the '0 distance' bin.

@mudshark: "What's the fraction of wrongly annotated gene starts anyway?"
I wouldn't say wrongly annotated. There are many genes showing alternative promoter use, and ignoring these promoters will lead you to conclude that fewer of your peaks are associated with TSS.

I just did a quick check. Ensembl 64, Homo sapiens.
54013 annotated genes in the database. 14423 have transcript starts that are more than 1000 bp from each other. 9817 have transcript starts that are more than 10 kb apart.

So about 27% of all genes are annotated with alternative promoter usage.
ffinkernagel is offline   Reply With Quote
Old 11-29-2011, 03:46 AM   #10
mudshark
Senior Member
 
Location: Munich

Join Date: Jan 2009
Posts: 138
Default

what if your ChIP target is (biologically) not preferentially associated with transcription start/PIC or RNA polymerase but is e.g. associated with splicing? wouldn't the approach to take the closest TSS bias the analysis (even more)?

and as regards the 27%, the question is still up how many of the alternative TSS are really used (in your model system)
mudshark is offline   Reply With Quote
Old 11-29-2011, 04:54 AM   #11
ffinkernagel
Senior Member
 
Location: Marburg, Germany

Join Date: Oct 2009
Posts: 110
Default

I'm not sure I understand your point. If your hypothesis is 'my chip target associates with TSS', then yes, you will need to discount for the fact that 'more TSS' means 'lower average distance' - even for 'random' or non-TSS associated binding sites.
But a factor associated with splicing should have a large number of binding sites not overlapping a TSS.

Quote:
and as regards the 27%, the question is still up how many of the alternative TSS are really used (in your model system)
I'd say the question would be which of the alternative TSS of each gene is used in a given model system.

Arguably, you could use the strongest / most prominent TSS for each gene in your given model system, if you had the RNA or PolII data.
But a gene might be 'off' (for any given value of off) while still having it's (inactive) TSS bound by your factor of choice.
That's still an association betwenn the factor and the TSS in my book - and you trade information about your model organism - which hopefully integrates any number of tissues and conditions - against the current cell (population) state in your particular model system.

I can imagine situations and questions where either view is the appropriate choice.

Regarding the question 'use Ensembl gene starts' vs 'use Ensembl transcript starts',
one should keep in mind that Ensembl will generally annotate the most 5` TSS in any condition as the gene start.

Last edited by ffinkernagel; 11-29-2011 at 04:56 AM. Reason: adding a minor point
ffinkernagel is offline   Reply With Quote
Old 11-29-2011, 06:26 AM   #12
mudshark
Senior Member
 
Location: Munich

Join Date: Jan 2009
Posts: 138
Default

it probably nails down to the question: what is your target protein?

if you have a general transcription factor you can assume a rather fixed distance to polymerase peaks (if there are any). but does that tell you anything about the TSS, what if you have one annotated TSS 500 bp upstream and another one 1500 bp downstream of your peak? do you take into account the 'type' of promoter?

if you have a specific transcription factor it already gets more problematic as you cannot assume a fixed distance to the TSS. what if the factor binds systematically to intronic enhancers?

if you are not sure if your factor is a bona fide transcription factor, you cannot assume anything.

given these uncertainties, i would settle with the idea that whatever you do is not entirely correct (as long as you don't have additional information). so why make a big fuzz about alternative TSS?
mudshark is offline   Reply With Quote
Old 11-29-2011, 03:01 PM   #13
Jeannine
Member
 
Location: Australia

Join Date: Sep 2009
Posts: 14
Default

What a nice discussion, thanks a lot for that, very informative!!

Anyway, for my purpose I decided to use all the TSS for each gene, as suggested. Even though I very much liked the idea of solely using the expressed transcripts in my system, I unfortunately only have microarray data (not transcript specific), but that will hopefully change soon (RNA-seq in work).

Thanks again,
Jeannine
Jeannine is offline   Reply With Quote
Old 04-13-2016, 05:45 PM   #14
liweixie
Member
 
Location: Ann Arbor

Join Date: Oct 2013
Posts: 21
Default

Quote:
Originally Posted by mudshark View Post
you simply need a list of expressed genes of your model system determined using RNASeq or expression microarray (look in the databases) and filter your TSS list.
I would like to ask if I extract a list of interested genes from RNAseq that I would like to map their TSS region from the ChIPseq to draw a TSS plot. What can I do this?
Thanks!
liweixie is offline   Reply With Quote
Old 04-13-2016, 11:38 PM   #15
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by liweixie View Post
I would like to ask if I extract a list of interested genes from RNAseq that I would like to map their TSS region from the ChIPseq to draw a TSS plot. What can I do this?
Thanks!
I would recommend using deepTools, namely the computeMatrix and plotProfile functions.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:22 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO