SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Cufflinks 1.0.0: Major new features in assembly and differential expression Cole Trapnell Bioinformatics 31 08-26-2011 11:54 AM
Cufflinks differential expression problem Rachelly Bioinformatics 2 05-12-2011 12:08 AM
Cufflinks/Cuffdiff significant differential expression memo Bioinformatics 5 01-25-2011 10:49 AM
Differential expression analysis workflow in Cufflinks anna_vt Bioinformatics 4 12-19-2010 03:04 AM
help with differential gene expression with cufflinks and tophat waterboy Bioinformatics 1 11-28-2010 10:51 AM

Reply
 
Thread Tools
Old 02-08-2010, 04:59 PM   #1
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default Differential expression, splicing, and promoter use with Cufflinks

We are happy to announce a major update to Cufflinks that introduces some powerful new features and includes a number of performance improvement and bug fixes. Highlights include:
  • Cufflinks now includes a new tool, "Cuffdiff", which performs testing for differential expression, splicing, promoter use, and coding sequence output on two or more RNA-Seq samples. See the greatly expanded manual for details.
  • Cuffcompare now reports a file containing the "union" of all transfrags in the files you give it as input, greatly simplifying downstream validatation of novel transcripts.
  • Cufflinks' assembler has been overhauled and optimized, resulting in a speedup of 4-5 times over version 0.7.0, and a greatly reduced memory footprint. Phasing of splicing events has also been improved.
  • Many bugfixes, including for a number of bugs reported by users in these forums.

We hope you find the new functionality useful, and continue to report bugs and feature requests.
Cole Trapnell is offline   Reply With Quote
Old 02-08-2010, 11:15 PM   #2
Xi Wang
Senior Member
 
Location: MDC, Berlin, Germany

Join Date: Oct 2009
Posts: 317
Default

Thanks, Cole.

I am not sure I understood quite well how to give GTF annotation to Cuffdiff according to the manual.
First, is it required the matching of tss_id and p_id? If not, how does the program know which TSS is corresponding to a transcript?
Second, if the TSS of a transcript or primary transcript is unkown, the program will skip this transcript and won't look for the difference in promoter use, right?
Moreover, is it possible that to infer the TSS for RNA-seq data?

Many thanks.
__________________
Xi Wang
Xi Wang is offline   Reply With Quote
Old 02-09-2010, 03:10 AM   #3
chapmandu2
Junior Member
 
Location: Manchester, UK

Join Date: Apr 2009
Posts: 4
Default ABI Solid?

Hi Cole,

Thanks for the new release it looks really comprehensive and I look forward to trying it for my Illumina datasets. Do you have any plans to include ABI Solid support for TopHat and Cufflinks, especially now that bowtie supports colourspace?

Many thanks.
chapmandu2 is offline   Reply With Quote
Old 02-09-2010, 09:42 AM   #4
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by Xi Wang View Post
Thanks, Cole.

I am not sure I understood quite well how to give GTF annotation to Cuffdiff according to the manual.
First, is it required the matching of tss_id and p_id? If not, how does the program know which TSS is corresponding to a transcript?
Second, if the TSS of a transcript or primary transcript is unkown, the program will skip this transcript and won't look for the difference in promoter use, right?
Moreover, is it possible that to infer the TSS for RNA-seq data?

Many thanks.
Without tss_id and p_id attributes, Cufflinks will simply test for differential expression of transcripts and genes. You can attach these attributes to your own GTF file, but for convenience, cuffcompare now outputs a single file containing the "union" of all transfrags assembled you give it. So the basic workflow we recommend is:

1) Assemble each sample with cufflinks
2) Run cuffcompare on the sample transfrags all at the same time, providing a reference annotation if you want to classify your transfrags according to known, novel, etc.
3) Give the stdout.combined.gtf to cuffdiff, along with your original SAM alignments from the samples. Cuffdiff will re-estimate the abundances of the transfrags in the GTF using the alignments in each sample, and do the differential expression testing at the same time.

Optionally, you may wish to clean up the stdout.combined.gtf before running cuffdiff, to remove partial transfrags that resulted from low depth of sequencing coverage in one of the samples. We like to perform differential testing only on transcripts that are either already known to annotation or that we've assembled in two different samples independently.

As far as how cuffcompare assigns p_id and tss_id attributes:

* p_id is assigned just using the CDS records in the reference GTF. If there are no CDS records, there will be no p_ids. Similarly, if you run cuffcompare without a reference annotation along with your sample assemblies, there will be no p_id attributes in stdout.combined.gtf
* tss_id is assigned based on transfrags where the 5' ends are: two transcripts on the same strand and which share bases have the same TSS iff their 5' ends start within 100bp of each other. This threshhold is chosen based on our observation that depth of sequencing doesn't always reach to the end of the true transcript on either end. You can change it with the -d option (which I just realized is not listed in the manual - I will update it).

All this is to say that if you're hoping to just use a reference GTF with cuffdiff, you'll need to add those p_id and tss_id attributes yourself. You can do this with cuffcompare too, using a little hack:

cuffcompare -r reference.gtf reference.gtf reference.gtf

This will spit out a version of reference.gtf in stdout.combined.gtf that has the p_id and tss_id attributes attached.

Last edited by Cole Trapnell; 02-09-2010 at 09:59 PM.
Cole Trapnell is offline   Reply With Quote
Old 02-09-2010, 10:04 PM   #5
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by chapmandu2 View Post
Hi Cole,

Thanks for the new release it looks really comprehensive and I look forward to trying it for my Illumina datasets. Do you have any plans to include ABI Solid support for TopHat and Cufflinks, especially now that bowtie supports colourspace?

Many thanks.
Cufflinks should *in theory* already support Colorspace, since it takes SAM input, and doesn't call expressed SNPs by itself (yet). TopHat will hopefully support Colorspace sometime this spring. I've got a number of other features in TopHat and Cufflinks I need to get to, and I have to finish my thesis and graduate - so I can't give a timeline. However, it's an often requested feature, so I'd like to add support.
Cole Trapnell is offline   Reply With Quote
Old 02-22-2010, 03:48 AM   #6
Kasycas
Member
 
Location: Dublin, Ireland

Join Date: Sep 2009
Posts: 22
Default

Hi Cole,

Thanks for the new release! I've been trying to use cuffdiff as described above. It runs for a while and then terminates as follows;

Importance sampling posterior distribution
isoform TCONS_00000803 has no p_id, no CDS grouping analysis available here
Quantitating samples in locus [ chr1:152014391-152019257 ]
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
what(): Error in function boost::math::cdf(const normal_distribution<d>&, d): Random variate x is nan, but must be finite!
Aborted





I don't expect this is because of the lack of p_id as this happens earlier in the running of the program but it doesn't terminate. However... I've tried using cuffdiff on cuffcompare stdout.combined.gtf files that were derived with UCSC annotation AND Ensembl annotation and they both terminate after a similar incidence (isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here).

Would you know why this is happening?

Regards,

Karen
Kasycas is offline   Reply With Quote
Old 02-22-2010, 09:33 AM   #7
Kasycas
Member
 
Location: Dublin, Ireland

Join Date: Sep 2009
Posts: 22
Default

One more thing on a slightly separate issue. The output from cuffcompare stdout.tracking, according to the manual should contain;

Each of the columns after the fifth have the following format:
qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>


However, I have 4 numerical columns after the <FMI>, not three. What does the forth one relate to?

Example:
q1:ENSG00000188076|ENST00000342878|100|12.188023|11.834710|12.541337|11.044084

Thanks,

Karen
Kasycas is offline   Reply With Quote
Old 02-22-2010, 10:09 AM   #8
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by Kasycas View Post
Hi Cole,

Thanks for the new release! I've been trying to use cuffdiff as described above. It runs for a while and then terminates as follows;

Importance sampling posterior distribution
isoform TCONS_00000803 has no p_id, no CDS grouping analysis available here
Quantitating samples in locus [ chr1:152014391-152019257 ]
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
Calculating intial MLE
Tossing likely garbage isoforms
Revising MLE
Importance sampling posterior distribution
isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
what(): Error in function boost::math::cdf(const normal_distribution<d>&, d): Random variate x is nan, but must be finite!
Aborted





I don't expect this is because of the lack of p_id as this happens earlier in the running of the program but it doesn't terminate. However... I've tried using cuffdiff on cuffcompare stdout.combined.gtf files that were derived with UCSC annotation AND Ensembl annotation and they both terminate after a similar incidence (isoform TCONS_00002699 has no p_id, no CDS grouping analysis available here).

Would you know why this is happening?

Regards,

Karen
Another user reported this to me a few days ago, and I fixed it yesterday. It's a divide by zero error in the Jensen-Shannon variance calculation. I'll be releasing a fix in a few days. Please sign up for the mailing list if you haven't already - you'll get an email once I make the release.
Cole Trapnell is offline   Reply With Quote
Old 02-22-2010, 10:11 AM   #9
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by Kasycas View Post
One more thing on a slightly separate issue. The output from cuffcompare stdout.tracking, according to the manual should contain;

Each of the columns after the fifth have the following format:
qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>


However, I have 4 numerical columns after the <FMI>, not three. What does the forth one relate to?

Example:
q1:ENSG00000188076|ENST00000342878|100|12.188023|11.834710|12.541337|11.044084

Thanks,

Karen
The last column is the estimated depth of read coverage for that transfrag. Apologies - I will update the manual.
Cole Trapnell is offline   Reply With Quote
Old 02-22-2010, 03:47 PM   #10
Lesley
Junior Member
 
Location: New Zealand

Join Date: Jan 2008
Posts: 8
Default

Quote:
Originally Posted by Cole Trapnell View Post
Without tss_id and p_id attributes, Cufflinks will simply test for differential expression of transcripts and genes. You can attach these attributes to your own GTF file, but for convenience, cuffcompare now outputs a single file containing the "union" of all transfrags assembled you give it. So the basic workflow we recommend is:

1) Assemble each sample with cufflinks
2) Run cuffcompare on the sample transfrags all at the same time, providing a reference annotation if you want to classify your transfrags according to known, novel, etc.
3) Give the stdout.combined.gtf to cuffdiff, along with your original SAM alignments from the samples. Cuffdiff will re-estimate the abundances of the transfrags in the GTF using the alignments in each sample, and do the differential expression testing at the same time.

Optionally, you may wish to clean up the stdout.combined.gtf before running cuffdiff, to remove partial transfrags that resulted from low depth of sequencing coverage in one of the samples. We like to perform differential testing only on transcripts that are either already known to annotation or that we've assembled in two different samples independently.

As far as how cuffcompare assigns p_id and tss_id attributes:

* p_id is assigned just using the CDS records in the reference GTF. If there are no CDS records, there will be no p_ids. Similarly, if you run cuffcompare without a reference annotation along with your sample assemblies, there will be no p_id attributes in stdout.combined.gtf
* tss_id is assigned based on transfrags where the 5' ends are: two transcripts on the same strand and which share bases have the same TSS iff their 5' ends start within 100bp of each other. This threshhold is chosen based on our observation that depth of sequencing doesn't always reach to the end of the true transcript on either end. You can change it with the -d option (which I just realized is not listed in the manual - I will update it).

All this is to say that if you're hoping to just use a reference GTF with cuffdiff, you'll need to add those p_id and tss_id attributes yourself. You can do this with cuffcompare too, using a little hack:

cuffcompare -r reference.gtf reference.gtf reference.gtf

This will spit out a version of reference.gtf in stdout.combined.gtf that has the p_id and tss_id attributes attached.
Thanks for the info on the reference gtf. I downloaded both fasta and gtf from ensembl and ran into the chr problem. However, now when I run the cuffcompare on the reference genome I get tss_ids but no p_ids and the original gtf has CDS information.

I also had the following error when running cuffcompare on cufflinks output and the fixed gtf file that I guess has something to do with the cufflinks gtf files since there are two of them.

Warning: found 26695 transcripts with undetermined strand.
Warning: found 44851 transcripts with undetermined strand.

Cuffcompare then exits.

Any help on moving forward with cufflinks will be greatly appreciated.

Cheers,
Lesley
Lesley is offline   Reply With Quote
Old 03-03-2010, 06:58 AM   #11
seqfast
Member
 
Location: SF Bay Area

Join Date: Aug 2008
Posts: 16
Default Error messages

already reported ...

Last edited by seqfast; 03-03-2010 at 07:42 AM.
seqfast is offline   Reply With Quote
Old 03-03-2010, 11:32 AM   #12
jebe
Junior Member
 
Location: New Mexico

Join Date: Nov 2009
Posts: 9
Default cuffdiff considers only X, Y, and MT loci

Hi,

I ran tophat using the h_sapiens_37_asm index and converted the accepted_hits.sam file's chromosomes accessions to their corresponding number/letter (1,2,X,Y,MT). I wanted the chromosome notation to match the chromosome notation in the ensembl gtf file (Homo_sapiens.GRCh37.56.gtf). Next I ran cufflinks on each sample using the converted sam file outputted by tophat. Then I ran cuffcompare using the transcripts.gtf files from each samples (outputted by cufflinks) along with my reference gtf above. Finally, I fed the converted sam files and combined.gtf file into cuffdiff. Cuffdiff runs without error however it only considers loci on the X, Y and MT chromosomes. Has anyone else experienced this error?

Thank you in advance for any advice.
jebe is offline   Reply With Quote
Old 03-03-2010, 05:25 PM   #13
Xi Wang
Senior Member
 
Location: MDC, Berlin, Germany

Join Date: Oct 2009
Posts: 317
Default

Quote:
Originally Posted by jebe View Post
Hi,

I ran tophat using the h_sapiens_37_asm index and converted the accepted_hits.sam file's chromosomes accessions to their corresponding number/letter (1,2,X,Y,MT). I wanted the chromosome notation to match the chromosome notation in the ensembl gtf file (Homo_sapiens.GRCh37.56.gtf). Next I ran cufflinks on each sample using the converted sam file outputted by tophat. Then I ran cuffcompare using the transcripts.gtf files from each samples (outputted by cufflinks) along with my reference gtf above. Finally, I fed the converted sam files and combined.gtf file into cuffdiff. Cuffdiff runs without error however it only considers loci on the X, Y and MT chromosomes. Has anyone else experienced this error?

Thank you in advance for any advice.
Did you try convert the chromosome notation in the ensembl gtf to chr1,chr2,...chrX,chrY, and chrM? I think conversion in this way is much better.
__________________
Xi Wang
Xi Wang is offline   Reply With Quote
Old 03-04-2010, 07:28 AM   #14
blackgore
Member
 
Location: UK

Join Date: Sep 2009
Posts: 20
Default

This may be a naive question, as I'm only about to get into using Cufflinks (Bowtie and Tophat seem great though), but I have not been able to find any documentation about differential expression analysis when groups of samples are involved? My question is can you - and therefore how can you - specify that certain samples are replicates, and so be treated as a group when running differential expression analysis?
blackgore is offline   Reply With Quote
Old 03-13-2010, 03:11 PM   #15
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by Cole Trapnell View Post
Another user reported this to me a few days ago, and I fixed it yesterday. It's a divide by zero error in the Jensen-Shannon variance calculation. I'll be releasing a fix in a few days. Please sign up for the mailing list if you haven't already - you'll get an email once I make the release.
Cole,

First, thanks for an excellent software stack.

Was the release you are referring to > 0.8.1? I am using 0.8.1 (the latest available on the web site) and am experiencing this problem. It seems that since 0.8.1 was released on 2/13/2010 and you wrote the above on 2/22/2010 the the fix would be in a version later than 0.8.1. I hate to be a pest; I have no doubt you are very busy and dealing with (L)users is the last thing you need, but I'm a little stymied by this bug.

Thanks again.

P.S. Yes, I just subscribed to the mailing list.
kmcarr is offline   Reply With Quote
Old 03-15-2010, 11:48 AM   #16
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by kmcarr View Post
Cole,

First, thanks for an excellent software stack.

Was the release you are referring to > 0.8.1? I am using 0.8.1 (the latest available on the web site) and am experiencing this problem. It seems that since 0.8.1 was released on 2/13/2010 and you wrote the above on 2/22/2010 the the fix would be in a version later than 0.8.1. I hate to be a pest; I have no doubt you are very busy and dealing with (L)users is the last thing you need, but I'm a little stymied by this bug.

Thanks again.

P.S. Yes, I just subscribed to the mailing list.
Sorry for being unclear - I meant that I fixed the bug in version control, not that I fixed the bug and released a new version. The fix will be released in the upcoming 0.8.2. I had hoped to have it out early this week, but a number of things came up (including me getting pretty sick), and I'm still in the middle of some changes, which then need to be tested. Thanks for your patience.
Cole Trapnell is offline   Reply With Quote
Old 03-15-2010, 12:32 PM   #17
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by Cole Trapnell View Post
Sorry for being unclear - I meant that I fixed the bug in version control, not that I fixed the bug and released a new version. The fix will be released in the upcoming 0.8.2. I had hoped to have it out early this week, but a number of things came up (including me getting pretty sick), and I'm still in the middle of some changes, which then need to be tested. Thanks for your patience.
Thanks for the update Cole; I'll keep an eye out for the new version.

(Hope you're feeling better.)
kmcarr is offline   Reply With Quote
Old 03-18-2010, 12:42 PM   #18
jebe
Junior Member
 
Location: New Mexico

Join Date: Nov 2009
Posts: 9
Default cuffcompare tracking file

The output from cuffcompare stdout.tracking, according to the manual and replies on this website is as follows:

qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo >|<conf_hi>|<read coverage>

In my output I have the following:
q1:s5.202|s5.202.0|100|0.000000|0.409935|4.057937|4.537815

I'm just wondering how the conf_lo and conf_hi can be non-zero values, but the fpkm is zero. Has anyone else seen this result?

Thank you.
jebe is offline   Reply With Quote
Old 03-24-2010, 10:14 PM   #19
xiaomimeow
Junior Member
 
Location: Palo Alto

Join Date: Mar 2010
Posts: 4
Default

I was able to get cuffdiff to work by going to the source code and writing some lines to handle the divisions by zero. Installing boost and then compiling thereafter took a great deal of trial and error, though.

I am a bit puzzled by the results. There are some genes for which cufflinks/cuffcompare/cuffdiff asserts multiple isoforms (I used the Ensembl gtf). However, plots of the tophat mappings do not support such an interpretation. In other words, there is nothing going for some splice variants, yet they are deemed to be present. Does anyone else feel that the isoform calls are too permissive?
xiaomimeow is offline   Reply With Quote
Old 03-26-2010, 04:08 PM   #20
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

I just wanted to announce that v0.8.2 of Cufflinks addresses the divide by zero, along with a number of other issues.
Cole Trapnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:50 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO