Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Cole Trapnell
    Senior Member
    • Nov 2008
    • 213

    Differential expression, splicing, and promoter use with Cufflinks

    We are happy to announce a major update to Cufflinks that introduces some powerful new features and includes a number of performance improvement and bug fixes. Highlights include:
    • Cufflinks now includes a new tool, "Cuffdiff", which performs testing for differential expression, splicing, promoter use, and coding sequence output on two or more RNA-Seq samples. See the greatly expanded manual for details.
    • Cuffcompare now reports a file containing the "union" of all transfrags in the files you give it as input, greatly simplifying downstream validatation of novel transcripts.
    • Cufflinks' assembler has been overhauled and optimized, resulting in a speedup of 4-5 times over version 0.7.0, and a greatly reduced memory footprint. Phasing of splicing events has also been improved.
    • Many bugfixes, including for a number of bugs reported by users in these forums.


    We hope you find the new functionality useful, and continue to report bugs and feature requests.
  • Xi Wang
    Senior Member
    • Oct 2009
    • 317

    #2
    Thanks, Cole.

    I am not sure I understood quite well how to give GTF annotation to Cuffdiff according to the manual.
    First, is it required the matching of tss_id and p_id? If not, how does the program know which TSS is corresponding to a transcript?
    Second, if the TSS of a transcript or primary transcript is unkown, the program will skip this transcript and won't look for the difference in promoter use, right?
    Moreover, is it possible that to infer the TSS for RNA-seq data?

    Many thanks.
    Xi Wang

    Comment

    • chapmandu2
      Junior Member
      • Apr 2009
      • 4

      #3
      ABI Solid?

      Hi Cole,

      Thanks for the new release it looks really comprehensive and I look forward to trying it for my Illumina datasets. Do you have any plans to include ABI Solid support for TopHat and Cufflinks, especially now that bowtie supports colourspace?

      Many thanks.

      Comment

      • Cole Trapnell
        Senior Member
        • Nov 2008
        • 213

        #4
        Originally posted by Xi Wang View Post
        Thanks, Cole.

        I am not sure I understood quite well how to give GTF annotation to Cuffdiff according to the manual.
        First, is it required the matching of tss_id and p_id? If not, how does the program know which TSS is corresponding to a transcript?
        Second, if the TSS of a transcript or primary transcript is unkown, the program will skip this transcript and won't look for the difference in promoter use, right?
        Moreover, is it possible that to infer the TSS for RNA-seq data?

        Many thanks.
        Without tss_id and p_id attributes, Cufflinks will simply test for differential expression of transcripts and genes. You can attach these attributes to your own GTF file, but for convenience, cuffcompare now outputs a single file containing the "union" of all transfrags assembled you give it. So the basic workflow we recommend is:

        1) Assemble each sample with cufflinks
        2) Run cuffcompare on the sample transfrags all at the same time, providing a reference annotation if you want to classify your transfrags according to known, novel, etc.
        3) Give the stdout.combined.gtf to cuffdiff, along with your original SAM alignments from the samples. Cuffdiff will re-estimate the abundances of the transfrags in the GTF using the alignments in each sample, and do the differential expression testing at the same time.

        Optionally, you may wish to clean up the stdout.combined.gtf before running cuffdiff, to remove partial transfrags that resulted from low depth of sequencing coverage in one of the samples. We like to perform differential testing only on transcripts that are either already known to annotation or that we've assembled in two different samples independently.

        As far as how cuffcompare assigns p_id and tss_id attributes:

        * p_id is assigned just using the CDS records in the reference GTF. If there are no CDS records, there will be no p_ids. Similarly, if you run cuffcompare without a reference annotation along with your sample assemblies, there will be no p_id attributes in stdout.combined.gtf
        * tss_id is assigned based on transfrags where the 5' ends are: two transcripts on the same strand and which share bases have the same TSS iff their 5' ends start within 100bp of each other. This threshhold is chosen based on our observation that depth of sequencing doesn't always reach to the end of the true transcript on either end. You can change it with the -d option (which I just realized is not listed in the manual - I will update it).

        All this is to say that if you're hoping to just use a reference GTF with cuffdiff, you'll need to add those p_id and tss_id attributes yourself. You can do this with cuffcompare too, using a little hack:

        cuffcompare -r reference.gtf reference.gtf reference.gtf

        This will spit out a version of reference.gtf in stdout.combined.gtf that has the p_id and tss_id attributes attached.
        Last edited by Cole Trapnell; 02-09-2010, 09:59 PM.

        Comment

        • Cole Trapnell
          Senior Member
          • Nov 2008
          • 213

          #5
          Originally posted by chapmandu2 View Post
          Hi Cole,

          Thanks for the new release it looks really comprehensive and I look forward to trying it for my Illumina datasets. Do you have any plans to include ABI Solid support for TopHat and Cufflinks, especially now that bowtie supports colourspace?

          Many thanks.
          Cufflinks should *in theory* already support Colorspace, since it takes SAM input, and doesn't call expressed SNPs by itself (yet). TopHat will hopefully support Colorspace sometime this spring. I've got a number of other features in TopHat and Cufflinks I need to get to, and I have to finish my thesis and graduate - so I can't give a timeline. However, it's an often requested feature, so I'd like to add support.

          Comment

          • Kasycas
            Member
            • Sep 2009
            • 22

            #6
            Hi Cole,

            Thanks for the new release! I've been trying to use cuffdiff as described above. It runs for a while and then terminates as follows;

            Importance sampling posterior distribution
            isoform TCONS_00000803 has no p_id, no CDS grouping analysis available here
            Quantitating samples in locus [ chr1:152014391-152019257 ]
            Calculating intial MLE
            Tossing likely garbage isoforms
            Revising MLE
            Importance sampling posterior distribution
            Calculating intial MLE
            Tossing likely garbage isoforms
            Revising MLE
            Importance sampling posterior distribution
            Calculating intial MLE
            Tossing likely garbage isoforms
            Revising MLE
            Importance sampling posterior distribution
            Calculating intial MLE
            Tossing likely garbage isoforms
            Revising MLE
            Importance sampling posterior distribution
            isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here
            terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
            what(): Error in function boost::math::cdf(const normal_distribution<d>&, d): Random variate x is nan, but must be finite!
            Aborted





            I don't expect this is because of the lack of p_id as this happens earlier in the running of the program but it doesn't terminate. However... I've tried using cuffdiff on cuffcompare stdout.combined.gtf files that were derived with UCSC annotation AND Ensembl annotation and they both terminate after a similar incidence (isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here).

            Would you know why this is happening?

            Regards,

            Karen

            Comment

            • Kasycas
              Member
              • Sep 2009
              • 22

              #7
              One more thing on a slightly separate issue. The output from cuffcompare stdout.tracking, according to the manual should contain;

              Each of the columns after the fifth have the following format:
              qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>


              However, I have 4 numerical columns after the <FMI>, not three. What does the forth one relate to?

              Example:
              q1:ENSG00000188076|ENST00000342878|100|12.188023|11.834710|12.541337|11.044084

              Thanks,

              Karen

              Comment

              • Cole Trapnell
                Senior Member
                • Nov 2008
                • 213

                #8
                Originally posted by Kasycas View Post
                Hi Cole,

                Thanks for the new release! I've been trying to use cuffdiff as described above. It runs for a while and then terminates as follows;

                Importance sampling posterior distribution
                isoform TCONS_00000803 has no p_id, no CDS grouping analysis available here
                Quantitating samples in locus [ chr1:152014391-152019257 ]
                Calculating intial MLE
                Tossing likely garbage isoforms
                Revising MLE
                Importance sampling posterior distribution
                Calculating intial MLE
                Tossing likely garbage isoforms
                Revising MLE
                Importance sampling posterior distribution
                Calculating intial MLE
                Tossing likely garbage isoforms
                Revising MLE
                Importance sampling posterior distribution
                Calculating intial MLE
                Tossing likely garbage isoforms
                Revising MLE
                Importance sampling posterior distribution
                isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here
                terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
                what(): Error in function boost::math::cdf(const normal_distribution<d>&, d): Random variate x is nan, but must be finite!
                Aborted





                I don't expect this is because of the lack of p_id as this happens earlier in the running of the program but it doesn't terminate. However... I've tried using cuffdiff on cuffcompare stdout.combined.gtf files that were derived with UCSC annotation AND Ensembl annotation and they both terminate after a similar incidence (isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here).

                Would you know why this is happening?

                Regards,

                Karen
                Another user reported this to me a few days ago, and I fixed it yesterday. It's a divide by zero error in the Jensen-Shannon variance calculation. I'll be releasing a fix in a few days. Please sign up for the mailing list if you haven't already - you'll get an email once I make the release.

                Comment

                • Cole Trapnell
                  Senior Member
                  • Nov 2008
                  • 213

                  #9
                  Originally posted by Kasycas View Post
                  One more thing on a slightly separate issue. The output from cuffcompare stdout.tracking, according to the manual should contain;

                  Each of the columns after the fifth have the following format:
                  qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>


                  However, I have 4 numerical columns after the <FMI>, not three. What does the forth one relate to?

                  Example:
                  q1:ENSG00000188076|ENST00000342878|100|12.188023|11.834710|12.541337|11.044084

                  Thanks,

                  Karen
                  The last column is the estimated depth of read coverage for that transfrag. Apologies - I will update the manual.

                  Comment

                  • Lesley
                    Junior Member
                    • Jan 2008
                    • 8

                    #10
                    Originally posted by Cole Trapnell View Post
                    Without tss_id and p_id attributes, Cufflinks will simply test for differential expression of transcripts and genes. You can attach these attributes to your own GTF file, but for convenience, cuffcompare now outputs a single file containing the "union" of all transfrags assembled you give it. So the basic workflow we recommend is:

                    1) Assemble each sample with cufflinks
                    2) Run cuffcompare on the sample transfrags all at the same time, providing a reference annotation if you want to classify your transfrags according to known, novel, etc.
                    3) Give the stdout.combined.gtf to cuffdiff, along with your original SAM alignments from the samples. Cuffdiff will re-estimate the abundances of the transfrags in the GTF using the alignments in each sample, and do the differential expression testing at the same time.

                    Optionally, you may wish to clean up the stdout.combined.gtf before running cuffdiff, to remove partial transfrags that resulted from low depth of sequencing coverage in one of the samples. We like to perform differential testing only on transcripts that are either already known to annotation or that we've assembled in two different samples independently.

                    As far as how cuffcompare assigns p_id and tss_id attributes:

                    * p_id is assigned just using the CDS records in the reference GTF. If there are no CDS records, there will be no p_ids. Similarly, if you run cuffcompare without a reference annotation along with your sample assemblies, there will be no p_id attributes in stdout.combined.gtf
                    * tss_id is assigned based on transfrags where the 5' ends are: two transcripts on the same strand and which share bases have the same TSS iff their 5' ends start within 100bp of each other. This threshhold is chosen based on our observation that depth of sequencing doesn't always reach to the end of the true transcript on either end. You can change it with the -d option (which I just realized is not listed in the manual - I will update it).

                    All this is to say that if you're hoping to just use a reference GTF with cuffdiff, you'll need to add those p_id and tss_id attributes yourself. You can do this with cuffcompare too, using a little hack:

                    cuffcompare -r reference.gtf reference.gtf reference.gtf

                    This will spit out a version of reference.gtf in stdout.combined.gtf that has the p_id and tss_id attributes attached.
                    Thanks for the info on the reference gtf. I downloaded both fasta and gtf from ensembl and ran into the chr problem. However, now when I run the cuffcompare on the reference genome I get tss_ids but no p_ids and the original gtf has CDS information.

                    I also had the following error when running cuffcompare on cufflinks output and the fixed gtf file that I guess has something to do with the cufflinks gtf files since there are two of them.

                    Warning: found 26695 transcripts with undetermined strand.
                    Warning: found 44851 transcripts with undetermined strand.

                    Cuffcompare then exits.

                    Any help on moving forward with cufflinks will be greatly appreciated.

                    Cheers,
                    Lesley

                    Comment

                    • seqfast
                      Member
                      • Aug 2008
                      • 16

                      #11
                      Error messages

                      already reported ...
                      Last edited by seqfast; 03-03-2010, 07:42 AM.

                      Comment

                      • jebe
                        Junior Member
                        • Nov 2009
                        • 9

                        #12
                        cuffdiff considers only X, Y, and MT loci

                        Hi,

                        I ran tophat using the h_sapiens_37_asm index and converted the accepted_hits.sam file's chromosomes accessions to their corresponding number/letter (1,2,X,Y,MT). I wanted the chromosome notation to match the chromosome notation in the ensembl gtf file (Homo_sapiens.GRCh37.56.gtf). Next I ran cufflinks on each sample using the converted sam file outputted by tophat. Then I ran cuffcompare using the transcripts.gtf files from each samples (outputted by cufflinks) along with my reference gtf above. Finally, I fed the converted sam files and combined.gtf file into cuffdiff. Cuffdiff runs without error however it only considers loci on the X, Y and MT chromosomes. Has anyone else experienced this error?

                        Thank you in advance for any advice.

                        Comment

                        • Xi Wang
                          Senior Member
                          • Oct 2009
                          • 317

                          #13
                          Originally posted by jebe View Post
                          Hi,

                          I ran tophat using the h_sapiens_37_asm index and converted the accepted_hits.sam file's chromosomes accessions to their corresponding number/letter (1,2,X,Y,MT). I wanted the chromosome notation to match the chromosome notation in the ensembl gtf file (Homo_sapiens.GRCh37.56.gtf). Next I ran cufflinks on each sample using the converted sam file outputted by tophat. Then I ran cuffcompare using the transcripts.gtf files from each samples (outputted by cufflinks) along with my reference gtf above. Finally, I fed the converted sam files and combined.gtf file into cuffdiff. Cuffdiff runs without error however it only considers loci on the X, Y and MT chromosomes. Has anyone else experienced this error?

                          Thank you in advance for any advice.
                          Did you try convert the chromosome notation in the ensembl gtf to chr1,chr2,...chrX,chrY, and chrM? I think conversion in this way is much better.
                          Xi Wang

                          Comment

                          • blackgore
                            Member
                            • Sep 2009
                            • 20

                            #14
                            This may be a naive question, as I'm only about to get into using Cufflinks (Bowtie and Tophat seem great though), but I have not been able to find any documentation about differential expression analysis when groups of samples are involved? My question is can you - and therefore how can you - specify that certain samples are replicates, and so be treated as a group when running differential expression analysis?

                            Comment

                            • kmcarr
                              Senior Member
                              • May 2008
                              • 1181

                              #15
                              Originally posted by Cole Trapnell View Post
                              Another user reported this to me a few days ago, and I fixed it yesterday. It's a divide by zero error in the Jensen-Shannon variance calculation. I'll be releasing a fix in a few days. Please sign up for the mailing list if you haven't already - you'll get an email once I make the release.
                              Cole,

                              First, thanks for an excellent software stack.

                              Was the release you are referring to > 0.8.1? I am using 0.8.1 (the latest available on the web site) and am experiencing this problem. It seems that since 0.8.1 was released on 2/13/2010 and you wrote the above on 2/22/2010 the the fix would be in a version later than 0.8.1. I hate to be a pest; I have no doubt you are very busy and dealing with (L)users is the last thing you need, but I'm a little stymied by this bug.

                              Thanks again.

                              P.S. Yes, I just subscribed to the mailing list.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 05:37 AM
                              0 responses
                              5 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              109 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...