Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • New differential testing of cuffdiff/cufflinks since 1.3.0

    Hi all,

    I have been trying out the new 1.3 version of the cufflinks package to see if the testing for differential splicing/expression etc is improved.

    Now I expected a more stringent analysis, as announced, but I get mostly failed tests.

    I used the same reads.bam and annotation and the same parameters (5 biological replicates and 2 groups, 2x 100 bp PE reads, ~ 140mil per sample).

    This is my result with cuffdiff v1.0.3 for diff. promoter usage:

    191 FAIL
    13713 NOTEST
    4745 OK

    with 1403 Genes found with sign. diff. promoter usage.

    Now this is the cuffdiff v 1.3 result:

    5719 FAIL
    35 LOWDATA
    12301 NOTEST
    594 OK

    with 20 Genes found with sign. diff. promoter usage.

    It seems that the number of false positives is really reduced. The new version find only ~ 3.5% of genes tested to be differential. However, so many fail to be tested in the first place that the result is not really usable.

    Did anyone experience something similar? Is this failing of testing maybe related to my large dataset or the replicates?

    All the best,
    Sebastian

  • #2
    I am also having the same problem in both splicing.diff and promotor.diff.

    I have 5 samples in each of two groups, but only 25M reads per sample (100bp paired end).

    Promoters

    Cufflinks 1.0.3

    OK 11,309
    NOTEST 8,637
    FAIL 648

    with 7,902 significant

    Cuffdiff 1.3

    OK 1,383
    NOTEST 10,243
    FAIL 8,948

    2 significant

    Splicing

    Cufflinks 1.0.3

    OK 13,996
    NOTEST 51,240
    FAIL 2,933

    8,645 genes with sig. differential splicing

    Cufflinks 1.3

    OK 1,017
    NOTEST 28,032
    FAIL 37,736
    LOWDATA 1,401

    6 genes with sig. differential splicing.

    Comment


    • #3
      Just to be clear - this is happening only for the splicing.diff, promoters.diff (and maybe cds.diff?) files? Note gene_exp.diff or isoform_exp.diff?

      Can one or both of you send us (to the support list) a gene's worth of reads in one of these loci that are failing? And a snippet of GTF to run it against? The new code uses a sampling-based approach to estimate a null distribution of relative isoform abundances in each condition, rather than an analytic null model based on the gradient. The upside of this approach is that it's more accurate and more conservative - the downside is that the sampling method can fail under some conditions. I haven't seen this happening in any of my datasets (and I have one that looks extremely similar to yours), so I probably can't do anything about it without a small test data set that reproduces the problem. It's possible it's something easy I overlooked and can fix quickly.

      Comment


      • #4
        Just came across this. Anecdotally, a colleague here has seen something similar, switching from the old pipeline (using 1.0.3, with a workflow similar to Jeremy's Galaxy exercise - tophat, cufflinks -g, cuffcompare -R, cuffdiff -N) to the new pipeline (using v1.3 from the protocols paper - tophat -G [gtf], cufflinks (no RABT), cuffmerge -g (RABT), cuffdiff -u -b). Unfortunately I don't have example data to send. Just wondering if you guys or others were able to figure out what was happening.
        Last edited by turnersd; 04-04-2012, 04:31 AM. Reason: link to protocols paper

        Comment


        • #5
          Dear all,
          I also using the cufflinks package for our RNA-seq analysis and I also ran into the same problem.

          Upon examination of cuffdiff results we also note a striking amount of transcripts (38%) and genes (25%) with status FAIL. We found such result very hard to be justified. How we can manage to loose the 25% of examined elements?

          In order to get better results, reducing the FAIL number, we conducted different tests with different conditions.

          1) We have three samples, each one with three biological replicas. The plot1 (see Plot1 attached) shows the number of FAIL (cuffdiff's v1.3.0) obtained with 1, 2 or 3 biological replicas.
          As you can see the number of elements dramatically increases with the number of replicas (FAIL in tracking file).

          2) As DerSeb previously shown, I also tested the cuffdiff's behavior using different versions with all the three replicas.
          The plot (see Plot2 attached) clearly shows very different results. From cuffdiff 1.2.0, a remarkable worsening of the number of FAIL appears.

          From what I was able to see, there is no improvement with FAIL as the say with the 1.2.0 version. Essentially the number of FAILs increases with the number of biological replicas and with cufflinks versions following the v1.1.0.

          Do somebody find any possible solution for this issue? Could anybody provide me any explanation behind this results?

          Thanks
          Francesco
          Attached Files

          Comment


          • #6
            Hi francicco,

            We've fixed this issue in the upcoming release of Cufflinks 1.4.0, which is right around the corner. We've been a little swamped with the release of TopHat 2 and other items, but we're working hard to get this out because I know several groups have run into this. The explanation of what was going on is a bit complicated, but we were able to reproduce the issue on one of our test sets, and came with a nice fix for it. The newest version produces a handful of FAIL genes at most, and when we've looked at those, the genes are ones where Cuffdiff has flagged a genuine structural problem that prevents us from calling gene expression.

            Comment


            • #7
              Dear Cole,

              Thank you for you rapid answer! Do you know when the new version will be public available?

              I offer myself for testing the new version on my data, do you think would be possible?

              Cheers
              Francesco

              Comment


              • #8
                Hi,

                I am having a similar issue. I am running Tophat/Cufflinks pipeline.

                I have two groups of individuals (5 each), test and control. Two tissue samples each individual.

                Cuffdiff gives only one DE gene and one diff splicing. I do get about 400 with either DESeq or edgeR and about 800 hits for diff exon usage.

                I am running all latest versions (although the data were mapped with an older Tophat release, I think 1.0.3).

                Can I use the older version (1.0.3) of Cuffdiff with the files prodced by cuffmerge and cufflinks 1.3.0?

                Thanks

                Comment


                • #9
                  Originally posted by gcoppola View Post
                  Can I use the older version (1.0.3) of Cuffdiff with the files prodced by cuffmerge and cufflinks 1.3.0?
                  Thanks

                  That is also what I'm doing, can somebody say if that is correct?
                  Cheers
                  F

                  Comment


                  • #10
                    Hi guys,

                    Anyone try this with Cufflinks 2.0? Is the problem resolved? I also have approximately 40% of my genes as NOTEST right now with the old cufflinks

                    Comment


                    • #11
                      Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)



                      I got another issue with new CUFFLINK 2:

                      When I directly quantify against ensembl gtf, the cufflinks returned 0 expression for most of them. This only occurred when I used replicates. single sample group is fine. And seems only when transcripts matched to known gene's annotation.


                      #command:
                      cufflinks-2.0.0.Linux_x86_64/cuffdiff -p 8 -L P1,P2 -c 1 -b anFam2.fa -o cuffdiff.P1.P2.ensembl canFam2.67.gtf TOPHAT2.C1.bam,TOPHAT2.C2.bam,TOPHAT2.C3.bam,TOPHAT2.C4.bam TOPHAT2.C5.bam,TOPHAT2.C6.bam,TOPHAT2.C7.bam,TOPHAT2.C8.bam,TOPHAT2.C9.bam

                      Here are the number of genes returned FPKM 0 in cuffdiff:

                      $8 is for treatment P1, $9 is for treatment P2 in output.

                      awk ' $8 ==0 { i++}; END {print i " of " NR " = " i/NR*100 "%"} ' cuffdiff.P1.P2.ensembl/gene_exp.diff

                      cufflinks-2.0.0:

                      P1: 24649 of 24661 = 99.9513%
                      P2: 24645 of 24661 = 99.9351%

                      cufflinks 1.3.0 seems right:

                      P1: 6345 of 24661 = 25.7289%
                      P2: 6564 of 24661 = 26.6169%

                      Now I am using edgeR and DESeq for identifying DE genes, and use cuffddiff (v1.3.0) results (pvalue, FC >=1.5) as additional evidence in filtering.

                      But seems edgeR and DESeq only work on gene level and can not do isoform level analysis.

                      Comment


                      • #12
                        I personally do not trust cufflinks 2 results. For instance it gives 0 FPKM to transcript clearly expressed

                        Developers need to do something, sooner or later...

                        Comment


                        • #13
                          I have a similar issue. When I add more replicates the number of sig. genes goes down drastically. Finally after much searching I discovered that the number of FAIL in gene_exp.diff increases with more replicates. I reran everything with tophat2 and cufflinks2 and the results now are 0 sig. genes with all replicates, which it shouldn't be. When I look at the gene_exp.diff file I see that the big majority of status messages was not FAIL this time, but NOTEST.

                          Here's some statistics to my statement.

                          2+2 replicates (cufflinks 1.3.0)

                          NOTEST 8130
                          OK 34495
                          FAIL 271

                          3+3 replicates (cufflinks 1.3.0)

                          NOTEST 8271
                          OK 29908
                          FAIL 4887

                          4+4 replicates (cufflinks 1.3.0)

                          NOTEST 8645
                          OK 25996
                          FAIL 8823

                          Notice how the status FAIL increases here with more replicates.

                          Below is the statistics from the cufflinks2 runs with very large number of NOTEST resulting in 0 sig. genes.

                          4+4 replicates (cufflinks 2)
                          NOTEST 35560
                          OK 9142
                          FAIL 9

                          7+8 replicates (cufflinks 2)
                          NOTEST 38875
                          OK 6269
                          FAIL 0

                          7+8 replicates (cufflinks 2) but without frag-bias-correct, upper-quartile-norm and multiread-correct in the cuffdiff run
                          NOTEST 17534
                          OK 27558
                          FAIL 52


                          I would very much like to know the reason to this and if I can correct it somehow.
                          Last edited by glados; 05-30-2012, 06:42 AM.

                          Comment


                          • #14
                            We eventually came to the conclusion that the original problem in Cufflinks 1.3 was being caused by excessive variance between our samples. As more samples were added, the variance was getting bigger - this is why we only saw the problems in datasets with large numbers of samples. This made biological sense for us: our samples were from different patients with each patient given a before and after treatment sample.

                            In cufflinks 2, the large variance no longer caused the model to fall over, but it didn't find any significant genes: presumably because the variances were so large (which can be seen in the confidence limits on the FPKM estimation). We didn't see the large number of NOTESTs though.

                            Comment


                            • #15
                              Originally posted by glados View Post
                              I have a similar issue. When I add more replicates the number of sig. genes goes down drastically. Finally after much searching I discovered that the number of FAIL in gene_exp.diff increases with more replicates. I reran everything with tophat2 and cufflinks2 and the results now are 0 sig. genes with all replicates, which it shouldn't be. When I look at the gene_exp.diff file I see that the big majority of status messages was not FAIL this time, but NOTEST.

                              Here's some statistics to my statement.

                              2+2 replicates (cufflinks 1.3.0)

                              NOTEST 8130
                              OK 34495
                              FAIL 271

                              3+3 replicates (cufflinks 1.3.0)

                              NOTEST 8271
                              OK 29908
                              FAIL 4887

                              4+4 replicates (cufflinks 1.3.0)

                              NOTEST 8645
                              OK 25996
                              FAIL 8823

                              Notice how the status FAIL increases here with more replicates.

                              Below is the statistics from the cufflinks2 runs with very large number of NOTEST resulting in 0 sig. genes.

                              4+4 replicates (cufflinks 2)
                              NOTEST 35560
                              OK 9142
                              FAIL 9

                              7+8 replicates (cufflinks 2)
                              NOTEST 38875
                              OK 6269
                              FAIL 0

                              7+8 replicates (cufflinks 2) but without frag-bias-correct, upper-quartile-norm and multiread-correct in the cuffdiff run
                              NOTEST 17534
                              OK 27558
                              FAIL 52


                              I would very much like to know the reason to this and if I can correct it somehow.
                              Can you try re-running this analysis with --min-outlier-p 0 to see if it's the inline model checking that's causing the increase in NOTESTs?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              71 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X