Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What is the cuffdiff output really telling you?

    I've got two different cellular fractions and I'm looking for genes that are alternatively spliced, alternatively polyadenylated, differentially expressed, etc. I'm running cufflinks/cuffdiff in galaxy and I'm trying to grok what the different tests are doing.

    Cuffdiff outputs 11 files (four FPKM tracking files, 7 files of results). Omitting the four FPKM tracking files, here are the 7 results files with a snippet from the the cuffdiff documentation:

    1. Differential expression testing for transcripts: FPKM of one group vs FPKM of the other.
    2. Differential expression testing for genes: This sums the FPKM for transcripts sharing the same gene_id.
    3. Differential expression testing for coding sequence (CDS): This sums the FPKM of transcripts sharing a common p_id, which is the id of the coding sequence that this transcript contains.
    4. Differential expression testing for primary transcripts: This sums FPKM of transcripts sharing a common tss_id (transcription start site).
    5. Differential splicing tests: For each primary transcript, this tests the amount of overloading detected among isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript.
    6. Differential coding output: For each gene, this tests the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples.
    7. Differential promoter use: For each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.

    My questions are:

    1. How are tests for differential splicing (#5) different from tests for differential coding output (#6)?
    2. How are the tests for differential gene expression summing over gene ids (#2) different that tests for gene expression summing over CDS ids (#3)?
    3. Tests #5-7 above are testing something fundamentally different than the tests for differential gene expression (tests #1-4). I'd like a good explanation of how these groups of tests differ. E.g. how does #3 (differential expression over CDS) differ from #6 (differential coding output).

    Thanks very much in advance.

  • #2
    Originally posted by turnersd View Post
    My questions are:

    1. How are tests for differential splicing (#5) different from tests for differential coding output (#6)?
    2. How are the tests for differential gene expression summing over gene ids (#2) different that tests for gene expression summing over CDS ids (#3)?
    3. Tests #5-7 above are testing something fundamentally different than the tests for differential gene expression (tests #1-4). I'd like a good explanation of how these groups of tests differ. E.g. how does #3 (differential expression over CDS) differ from #6 (differential coding output).

    Thanks very much in advance.
    1. Different splice forms may have the same coding sequence. For example, the 5' UTR may be different.

    2. A single gene may produce multiple splice forms containing different coding exons, resulting in different CDS.

    3. Not familiar with these. Although #7 isn't really like the others since it's concerned with the first exon only.

    Comment


    • #3
      Hey guys,

      Does anyone have a good answer to turnersd's 3rd question? It does seem as if 5-7 are derived from 1-4. And it certainly seems like #3 is pretty much the same as #6.

      Comment


      • #4
        I believe 1-4 are grouping (or not grouping, in the case of isoforms) transcripts at the level of the gene (2), coding sequence (3), and transcription start site use (4), and testing for differential expression of these groups of transcripts between conditions. And I believe #5-7 are looking at whether there is significantly different TSS usage, or an imbalance in TSS usage overall? Does this make sense?

        Comment


        • #5
          Yes it makes sense. I'm just noticing there is a lot of overlap, which is good, I get that that allows you to use whichever method one is comfortable with. I guess the main issue I'm having is deciding which method to use. Does anyone know of any papers that actually use cuffdiff (that aren't published in PLOS One)?

          Comment


          • #6
            All of the *_exp.diff files cuffdiff produces are literally differential expression outputs. The remaining outputs test the probability that between any pairwise test there is a significant change in the balance of any locus. Note that those files only include results for genes that have at least 2 splice variants.

            For example, say a gene has two isoforms A and B and you have two conditions 1 and 2. In condition 1 the balance of expression at those isoforms is A=0.3 and B=0.7 where those sum up to 1 or 100% of the expression at that locus. Say cuffdiff finds that in condition 2 the balance has changed such that A=0.8 and B=0.2. Depending on the variability across replicates, of course, that change may end up being reported statistically significant. This result would be found in the splicing.diff file.

            You can apply the same thinking to make sense out of the cds.diff and promoters.diff files. Again each of these files only test genes with more than one isoform.

            *Edit*

            I guess you can think of those three files as a more general result than the corresponding *_exp.diff files. Since the *_exp.diff files specifically test each gene/cds/isoform/tss they don't give you a number that tells you if there's an overall change in the expression across a locus. Whether or not these files are interesting to you probably just depends on what it is you're looking for. If you're generally interested in genes that aren't necessarily differentially expressed but might be producing different amounts of the different proteins they code then you might start with the cds.diff file for your gene list.
            Last edited by sdriscoll; 04-23-2012, 04:30 PM.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #7
              Hey guys,

              So I've been playing with this stuff a bit more, and I was hoping you guys could shed some light on this.

              So the splicing.diff file and the tss_group file are the exact same, except splicing.diff uses Jenson and Tss uses p-value. No idea why the authors included both. But much more importantly, why is it even called splicing.diff? Splicing.diff and tss_group measure differentiation between samples based on transcription start site, so shouldn't that be actually differential promoter use?

              Comment


              • #8
                so splicing.diff, cds.diff and promoters.diff measure something different that each of the *_exp.diff files. Instead of differential expression they measure the significance of the difference in expression balance at any given loci. Also splicing.diff is telling us something about differential splicing even between multiple isoforms that have the same promoter. therefore promoters.diff is more generalized than splicing.diff.

                I can think of a possible example that makes these files make more sense. Say I have two samples, A and B, and I'm wondering if sample B tends to have different promoter useage than sample A. I could figure this out based on the output of isoform_exp.diff or tss_group_exp.diff but the file promoters.diff tells me this directly. We get a p-value telling us if sample B has significantly different promoter usage at any gene loci relative to sample A.

                The same scenario could come up for coding sequence. Is sample B producing significantly different proteins relative to sample A? The cds.diff file gives you that estimation. Now you don't have to parse the isoform_exp.diff file and figure out which ones are differentially expressed and which ones have CDS regions, etc.

                As for splicing.diff this file gives you a general measure of differential splicing between samples. So in sample B is there a significantly different balance of expression across isoforms for any gene loci relative to sample A. This isn't as specific as asking, "which isoforms are differentiall expressed", it's just a general measurement. In other words you can very quickly have a gene list for those genes that seem to be differentially spliced in sample B relative to sample A. again you could probably build this list by parsing isoform_exp.diff but you'd have to filter out single isoform genes and you'd also be buried in a file with 90,000 rows instead of one that's already summarized into 30,000 loci (or less).

                does that make sense?
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  splicing file from tophat

                  Originally posted by sdriscoll View Post
                  so splicing.diff, cds.diff and promoters.diff measure something different that each of the *_exp.diff files. Instead of differential expression they measure the significance of the difference in expression balance at any given loci. Also splicing.diff is telling us something about differential splicing even between multiple isoforms that have the same promoter. therefore promoters.diff is more generalized than splicing.diff.

                  I can think of a possible example that makes these files make more sense. Say I have two samples, A and B, and I'm wondering if sample B tends to have different promoter useage than sample A. I could figure this out based on the output of isoform_exp.diff or tss_group_exp.diff but the file promoters.diff tells me this directly. We get a p-value telling us if sample B has significantly different promoter usage at any gene loci relative to sample A.

                  The same scenario could come up for coding sequence. Is sample B producing significantly different proteins relative to sample A? The cds.diff file gives you that estimation. Now you don't have to parse the isoform_exp.diff file and figure out which ones are differentially expressed and which ones have CDS regions, etc.

                  As for splicing.diff this file gives you a general measure of differential splicing between samples. So in sample B is there a significantly different balance of expression across isoforms for any gene loci relative to sample A. This isn't as specific as asking, "which isoforms are differentiall expressed", it's just a general measurement. In other words you can very quickly have a gene list for those genes that seem to be differentially spliced in sample B relative to sample A. again you could probably build this list by parsing isoform_exp.diff but you'd have to filter out single isoform genes and you'd also be buried in a file with 90,000 rows instead of one that's already summarized into 30,000 loci (or less).

                  does that make sense?
                  So in my case I looked at the splice.diff, there is no significant hit.
                  If I digged in the isoform_exp.diff, there are isoforms who are differently expressed. If one gene_id appears twice or more, does it mean there is a splicing event occured in that gene?
                  thanks

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X