SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CuffDiff output Rachelly Bioinformatics 11 04-17-2012 08:04 PM
What is the cuffdiff output really telling you? (x-post from bioinformatics) turnersd RNA Sequencing 0 12-05-2011 07:42 AM
Cuffdiff output sheenams RNA Sequencing 0 11-27-2011 03:41 PM
my understanding for cuffdiff output Huijuan Bioinformatics 1 05-01-2011 04:42 AM
cuffdiff output dnusol Bioinformatics 2 02-08-2011 10:31 PM

Reply
 
Thread Tools
Old 12-05-2011, 07:41 AM   #1
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 112
Default What is the cuffdiff output really telling you?

I've got two different cellular fractions and I'm looking for genes that are alternatively spliced, alternatively polyadenylated, differentially expressed, etc. I'm running cufflinks/cuffdiff in galaxy and I'm trying to grok what the different tests are doing.

Cuffdiff outputs 11 files (four FPKM tracking files, 7 files of results). Omitting the four FPKM tracking files, here are the 7 results files with a snippet from the the cuffdiff documentation:

1. Differential expression testing for transcripts: FPKM of one group vs FPKM of the other.
2. Differential expression testing for genes: This sums the FPKM for transcripts sharing the same gene_id.
3. Differential expression testing for coding sequence (CDS): This sums the FPKM of transcripts sharing a common p_id, which is the id of the coding sequence that this transcript contains.
4. Differential expression testing for primary transcripts: This sums FPKM of transcripts sharing a common tss_id (transcription start site).
5. Differential splicing tests: For each primary transcript, this tests the amount of overloading detected among isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript.
6. Differential coding output: For each gene, this tests the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples.
7. Differential promoter use: For each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.

My questions are:

1. How are tests for differential splicing (#5) different from tests for differential coding output (#6)?
2. How are the tests for differential gene expression summing over gene ids (#2) different that tests for gene expression summing over CDS ids (#3)?
3. Tests #5-7 above are testing something fundamentally different than the tests for differential gene expression (tests #1-4). I'd like a good explanation of how these groups of tests differ. E.g. how does #3 (differential expression over CDS) differ from #6 (differential coding output).

Thanks very much in advance.
turnersd is offline   Reply With Quote
Old 12-05-2011, 09:20 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by turnersd View Post
My questions are:

1. How are tests for differential splicing (#5) different from tests for differential coding output (#6)?
2. How are the tests for differential gene expression summing over gene ids (#2) different that tests for gene expression summing over CDS ids (#3)?
3. Tests #5-7 above are testing something fundamentally different than the tests for differential gene expression (tests #1-4). I'd like a good explanation of how these groups of tests differ. E.g. how does #3 (differential expression over CDS) differ from #6 (differential coding output).

Thanks very much in advance.
1. Different splice forms may have the same coding sequence. For example, the 5' UTR may be different.

2. A single gene may produce multiple splice forms containing different coding exons, resulting in different CDS.

3. Not familiar with these. Although #7 isn't really like the others since it's concerned with the first exon only.
dpryan is offline   Reply With Quote
Old 04-20-2012, 03:10 PM   #3
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

Hey guys,

Does anyone have a good answer to turnersd's 3rd question? It does seem as if 5-7 are derived from 1-4. And it certainly seems like #3 is pretty much the same as #6.
billstevens is offline   Reply With Quote
Old 04-21-2012, 03:47 AM   #4
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 112
Default

I believe 1-4 are grouping (or not grouping, in the case of isoforms) transcripts at the level of the gene (2), coding sequence (3), and transcription start site use (4), and testing for differential expression of these groups of transcripts between conditions. And I believe #5-7 are looking at whether there is significantly different TSS usage, or an imbalance in TSS usage overall? Does this make sense?
turnersd is offline   Reply With Quote
Old 04-23-2012, 05:23 AM   #5
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

Yes it makes sense. I'm just noticing there is a lot of overlap, which is good, I get that that allows you to use whichever method one is comfortable with. I guess the main issue I'm having is deciding which method to use. Does anyone know of any papers that actually use cuffdiff (that aren't published in PLOS One)?
billstevens is offline   Reply With Quote
Old 04-23-2012, 04:26 PM   #6
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

All of the *_exp.diff files cuffdiff produces are literally differential expression outputs. The remaining outputs test the probability that between any pairwise test there is a significant change in the balance of any locus. Note that those files only include results for genes that have at least 2 splice variants.

For example, say a gene has two isoforms A and B and you have two conditions 1 and 2. In condition 1 the balance of expression at those isoforms is A=0.3 and B=0.7 where those sum up to 1 or 100% of the expression at that locus. Say cuffdiff finds that in condition 2 the balance has changed such that A=0.8 and B=0.2. Depending on the variability across replicates, of course, that change may end up being reported statistically significant. This result would be found in the splicing.diff file.

You can apply the same thinking to make sense out of the cds.diff and promoters.diff files. Again each of these files only test genes with more than one isoform.

*Edit*

I guess you can think of those three files as a more general result than the corresponding *_exp.diff files. Since the *_exp.diff files specifically test each gene/cds/isoform/tss they don't give you a number that tells you if there's an overall change in the expression across a locus. Whether or not these files are interesting to you probably just depends on what it is you're looking for. If you're generally interested in genes that aren't necessarily differentially expressed but might be producing different amounts of the different proteins they code then you might start with the cds.diff file for your gene list.

Last edited by sdriscoll; 04-23-2012 at 04:30 PM.
sdriscoll is offline   Reply With Quote
Old 04-30-2012, 10:00 AM   #7
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

Hey guys,

So I've been playing with this stuff a bit more, and I was hoping you guys could shed some light on this.

So the splicing.diff file and the tss_group file are the exact same, except splicing.diff uses Jenson and Tss uses p-value. No idea why the authors included both. But much more importantly, why is it even called splicing.diff? Splicing.diff and tss_group measure differentiation between samples based on transcription start site, so shouldn't that be actually differential promoter use?
billstevens is offline   Reply With Quote
Old 04-30-2012, 10:58 AM   #8
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

so splicing.diff, cds.diff and promoters.diff measure something different that each of the *_exp.diff files. Instead of differential expression they measure the significance of the difference in expression balance at any given loci. Also splicing.diff is telling us something about differential splicing even between multiple isoforms that have the same promoter. therefore promoters.diff is more generalized than splicing.diff.

I can think of a possible example that makes these files make more sense. Say I have two samples, A and B, and I'm wondering if sample B tends to have different promoter useage than sample A. I could figure this out based on the output of isoform_exp.diff or tss_group_exp.diff but the file promoters.diff tells me this directly. We get a p-value telling us if sample B has significantly different promoter usage at any gene loci relative to sample A.

The same scenario could come up for coding sequence. Is sample B producing significantly different proteins relative to sample A? The cds.diff file gives you that estimation. Now you don't have to parse the isoform_exp.diff file and figure out which ones are differentially expressed and which ones have CDS regions, etc.

As for splicing.diff this file gives you a general measure of differential splicing between samples. So in sample B is there a significantly different balance of expression across isoforms for any gene loci relative to sample A. This isn't as specific as asking, "which isoforms are differentiall expressed", it's just a general measurement. In other words you can very quickly have a gene list for those genes that seem to be differentially spliced in sample B relative to sample A. again you could probably build this list by parsing isoform_exp.diff but you'd have to filter out single isoform genes and you'd also be buried in a file with 90,000 rows instead of one that's already summarized into 30,000 loci (or less).

does that make sense?
sdriscoll is offline   Reply With Quote
Old 02-08-2020, 08:30 AM   #9
bbm
Member
 
Location: North Carolina

Join Date: Sep 2011
Posts: 38
Default splicing file from tophat

Quote:
Originally Posted by sdriscoll View Post
so splicing.diff, cds.diff and promoters.diff measure something different that each of the *_exp.diff files. Instead of differential expression they measure the significance of the difference in expression balance at any given loci. Also splicing.diff is telling us something about differential splicing even between multiple isoforms that have the same promoter. therefore promoters.diff is more generalized than splicing.diff.

I can think of a possible example that makes these files make more sense. Say I have two samples, A and B, and I'm wondering if sample B tends to have different promoter useage than sample A. I could figure this out based on the output of isoform_exp.diff or tss_group_exp.diff but the file promoters.diff tells me this directly. We get a p-value telling us if sample B has significantly different promoter usage at any gene loci relative to sample A.

The same scenario could come up for coding sequence. Is sample B producing significantly different proteins relative to sample A? The cds.diff file gives you that estimation. Now you don't have to parse the isoform_exp.diff file and figure out which ones are differentially expressed and which ones have CDS regions, etc.

As for splicing.diff this file gives you a general measure of differential splicing between samples. So in sample B is there a significantly different balance of expression across isoforms for any gene loci relative to sample A. This isn't as specific as asking, "which isoforms are differentiall expressed", it's just a general measurement. In other words you can very quickly have a gene list for those genes that seem to be differentially spliced in sample B relative to sample A. again you could probably build this list by parsing isoform_exp.diff but you'd have to filter out single isoform genes and you'd also be buried in a file with 90,000 rows instead of one that's already summarized into 30,000 loci (or less).

does that make sense?
So in my case I looked at the splice.diff, there is no significant hit.
If I digged in the isoform_exp.diff, there are isoforms who are differently expressed. If one gene_id appears twice or more, does it mean there is a splicing event occured in that gene?
thanks
bbm is offline   Reply With Quote
Reply

Tags
cuffdiff, cufflinks, rna-seq, rna-seq data analysis

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO