Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inconsistency between cuffdiff 1.1.0 and cuffdiff 1.0.2

    Hello everybody,
    I'm writing here because I've noticed that cuffdiff 1.1.0 and 1.0.2 give really different results when it comes to DE testing.

    This is what I get with 1.1.0
    Code:
    cuffdiff -p 4 -o . -N --emit-count-tables --frag-bias-correct /disk1/tl344/rep/iGenome/Mus_musculus/Ensembl/NCBIM37/Sequence/WholeGenomeFasta/genome.fa -M /disk1/tl344/rep/UCSC.chrM.renamed.gtf /disk1/tl344/cuff/illumina/cuffcompare/cuffcompare.combined.gtf /disk1/tl344/mapped_reads/sample1_properlyPaired2.bam /disk1/tl344/mapped_reads/sample2_properlyPaired2.bam/disk1/tl344/mapped_reads/sample3_properlyPaired2.bam &
    
    tl344@ram--bio:~/cuff/illumina/cuffdiff$ awk '$5=="q1" && $6=="q2"' gene_exp.diff|cut -f7 |sort |uniq -c
    434 FAIL
    1 HIDATA
    8338 LOWDATA
    34844 NOTEST
    4474 OK
    
    tl344@ram--bio:~/cuff/illumina/cuffdiff$ awk '$5=="q1" && $6=="q2" && $14=="yes"' gene_exp.diff |wc -l
    175
    While this is what I get with 1.0.2:
    Code:
    tl344@ram--bio:~/cuff/illumina/cuffdiff_1.0.2$ awk '$5=="q1" && $6=="q2"' gene_exp.diff|cut -f7 |sort |uniq -c
    379 FAIL
    32 LOWDATA
    16460 NOTEST
    31220 OK
    
    tl344@ram--bio:~/cuff/illumina/cuffdiff_1.0.2$ awk '$5=="q1" && $6=="q2" && $14=="yes"' gene_exp.diff |wc -l
    1534
    All default options for the two versions appear to be the same, the command line options are exactly the same and the files (gtf, fa, bam) are the same.

    As you can see, the number of significant genes changes quite a lot.
    I think this might be because of the fact that (from the 1.1.0 release notes)
    Cuffdiff now includes a more sophisticated check for sufficient sequencing depth prior to testing for differences, which substantially improves the accuracy of differential expression analysis in loci with low to medium depth.
    but still I wouldn't expect such a big difference.
    Is there anyone who had similar problems? Any idea if this is expected behaviour or if there's something wrong?

    Many thanks in advance for any feedback!
    tom

  • #2
    Hmm, I just saw a reply from Cole on another thread that might be relevant:

    Originally posted by Cole Trapnell View Post
    A bit more on what FAIL means, and how it can happen. We use FAIL for genes that actually throw a numerical exception during isoform abundance calculation. In Cufflinks and Cuffdiff, there's a couple of calculations that require us to build matrices with either a row per transcript and a column per read (more or less) or a square matrix with a row and column for each transcript. Some of these matrices need to be invertible or positive definite or have other properties in order for the next steps in the algorithm to succeed. However, sometimes (due to things like round-off error) they aren't. Other times, missing data causes trouble. Oddly enough, this is actually more likely to happen the more reads you get overall, because you can see that isoforms are present, but you don't actually have enough data to calculate those abundances. This is the effect you were observing above. So since we can't be sure about the values (and in fact, were we to go ahead and do the calculation anyways, they could be *wildly* off in theory, or even negative), we set them to zero and move on.

    In order to make differential expression estimates more conservative, version 1.1.0 really ramped up the checks that are done before these steps so we don't end up reporting false positives that are due to numerical exceptions. However, users (like yourselves) have been pretty frustrated by those changes, so I've spent the last few weeks going back and streamlining the overall algorithm to actually eliminate pieces that require the matrices to have some of those properties. The main offender was our "importance sampling" procedure, which tries to give us a sense (for each gene) for the accuracy for the maximum likelihood estimate of isoform abundances. This procedure was originally meant to improve the robustness of the estimate when one or more isoforms were close to zero, but in practice, we found that it actually hurts as often as it helps. Moreover, this procedure would often FAIL genes, so I removed it altogether. I've compensated on the differential expression side with some other statistical improvements and fixes, and the result is globally more accurate differential analysis (both in terms of fewer FAILs and fewer false positives than 1.1.0).

    The upcoming version 1.2.0 should drastically reduce the number of FAIL genes, though there will still be some. If we can't calculate an MLE to begin with, or if for some reason the confidence interval calculation fails, a gene will be marked as FAIL.

    Hope this sheds light on things.

    Comment


    • #3
      Thank you cw11 for your reply. I've read the post that you linked, and indeed it is relevant. What I see might in fact be a consequence of removing the "importance sampling" procedure in 1.1.0
      I was wondering if this is a problem only for me..

      Comment


      • #4
        I have the same problem. Please read my post here.

        Question on Cuffdiff’s DE results.

        Code:
        Quote:
        cuffdiff -p $NSLOTS -o cuffdiff_v110 Ensembl_59.gtf C_3732_hits.bam,C_6341_hits.bam,C_6356_hits.ba
        m,C_6124_hits.bam,C_60PO_hits.bam HD_4430_hits.bam,HD_4412_hits.bam,HD_4242_hits.bam,HD_4189_hits
        .bam,HD_3584_hits.bam
        Versions
        v091 v101 v102 v103 v110
        OK
        18112 33215 33230 33216 6765
        Fail
        3957 2524 2509 2523 1234
        LOWDATA
        NA 514 514 514 9000
        NOTEST
        29582 15408 15408 15408 34634
        Total
        51651 51661 51661 51661 51633


        We wouldn’t expect such a big difference[1] between v110 and v101(v102,v103)especially in the numbers of NOTEST and OK. Cole Trapnell explained what does “Fail” mean [2], we would like to know what “NOTEST” means here.

        All default options for the two versions appear to be the same; the command line options are exactly the same when we use different versions.
        The cuffdiff v1.1.0 indicated.
        “Cuffdiff now includes a more sophisticated check for sufficient sequencing depth prior to testingfor differences, which substantially improves the accuracy of differential expression analysis in loci with low to medium depth.”


        Question on Cuffdiff’s q-value.

        What is q-value here? Lots of evidence suggests q-value should be less than one [4], [5].


        Reference:
        (1) Inconsistency between cuffdiff 1.1.0 and cuffdiff 1.0.2

        (2) Cole Trapnell on “Fail” test

        (3) Cuffdiff producing q-values > 1
        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

        (4) Understanding q-values as a More Intuitive Alternative to p-values

        (5) P-values, False Discovery Rate (FDR) and q-values What are p-values?

        (6) Cuffdiff producing q-values > 1
        http://seqanswers.com/forums/showthread.php?t=13764

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 11:49 AM
        0 responses
        15 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-24-2024, 08:47 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        61 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Working...
        X