SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Raw read counts for RNAseq biofreak Introductions 13 01-16-2013 05:28 AM
Raw read counts for RNAseq biofreak RNA Sequencing 2 06-15-2011 05:56 AM
DESeq: Read counts vs. BP counts burkard Bioinformatics 0 08-05-2010 11:52 PM
Cufflinks convert FPKM to Read Counts zee Bioinformatics 0 03-08-2010 06:35 PM
How to convert cufflinks output to raw counts jebe RNA Sequencing 0 01-26-2010 11:29 AM

Reply
 
Thread Tools
Old 07-01-2010, 10:30 AM   #1
jebe
Junior Member
 
Location: New Mexico

Join Date: Nov 2009
Posts: 9
Default Converting FPKM from Cufflinks to raw counts for DESeq

How to I convert the FPKM values output by Cufflinks in genes.expr to raw counts to import into DESeq?
jebe is offline   Reply With Quote
Old 07-26-2010, 06:17 AM   #2
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default counting reads

hi,

i stuggled to figure this out myself for a while.

I've wound up using Simon Anders' HT-Seq

http://http://www-huber.embl.de/user...unt.html#count

The starting point for this, however, is the sam file, not FPKM.

Chris
chrisbala is offline   Reply With Quote
Old 07-27-2010, 09:32 AM   #3
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

@Chrisbala:

We've discussed this issue on a number of other threads before. It doesn't make sense to convert FPKM values to read count estimates for the purposes of running DESeq.

If you want to test gene level differential expression you can just use HT-Seq as you noted. However you should know that differential expression by count data will not allow you to compare expression values of different transcripts (or genes) within a single experiment nor will you be able to test for isoform level differential expression. Unfortunately the DESeq approach is simply not suited to that. Furthermore, even for gene level differential expression, DESeq can be inaccurate when genes overlap (or when reads cannot be assigned uniquely to a gene due to duplicates, etc.)
lpachter is offline   Reply With Quote
Old 08-13-2010, 07:07 PM   #4
Balat
Member
 
Location: Australia

Join Date: May 2010
Posts: 36
Default

Yes this issue was discussed a number of times. One way to convert FPKM values is to multiply the FPKM values with transcript length and the number of reads mapped in million. Trascript length can be obtained using HTSeq.
I can't understand why it is not valid to convert FPKM values into counts and use edgeR or DESeq to test for differential expression.Especially when you want to use biological replicates for testing differential expression. Get the FPKM values by comparing expression between different samples by cufflinks. Then convert FPKM values into read counts and use any of the 'R' programs to test for differential expression.
Balat is offline   Reply With Quote
Old 08-15-2010, 08:53 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by lpachter View Post
Furthermore, even for gene level differential expression, DESeq can be inaccurate when genes overlap (or when reads cannot be assigned uniquely to a gene due to duplicates, etc.)
This is incorrect. If I have several samples, the probability that a read gets discarded due to it falling on overlapping genes is the same in all samples. Hence, under the null hypothesis of no differential expression, the count in each sample gets reduced by the same proportion, so that the effect cancels out. One might lose power because of the overlap, but the result of the test is correct.

This requires, of course, that one properly discard reads mapping to overlaps. This is what htseq-count ensures. Not that htseq-count would be more than a simple script but it is important to note that it does this. On the Bioconductor website, you will find (in the workshop materials section) several explanations on how to do the counting in R, and unfortunately, all these do not properly take care of overlaps.

Last edited by Simon Anders; 08-15-2010 at 09:20 AM.
Simon Anders is offline   Reply With Quote
Old 08-15-2010, 09:16 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by Balat View Post
I can't understand why it is not valid to convert FPKM values into counts and use edgeR or DESeq to test for differential expression.
If it were that each read maps to one transcript, you could multiply the FPKM values with the transcript length to get raw counts again.

However, the whole point of cufflinks is to deal with the fact that most reads will map to several transcripts, and each read can hence influence the FPKM values of all these transcripts, and it will definitely not augment each count by one.

A crucial aspect of DESeq and edgeR is that they assess the shot noise by assuming that each counting unit is evidence of one sequencing read, and hence the counting noise follows a Poisson distribution. In cufflinks' output this is not the case. Instead, cufflinks calculates the uncertainty for you using more involved math.

DESeq and edgeR use a simple way to include counting noise and go some lengths assessing biological noise. cufflinks offers you a more sophisticated estimate for the counting noise that can deal with reads mapping to multiple transcripts, but, at least so far, no means of assessing biological noise.

There are two fundamentally different tasks that you should not mix up, and that are served by different tools:

1. If you want to compare the abundance of two different transcripts in the same sample, cufflinks will allow you to do this even if these transcripts overlap with other transcripts that should stay out of the comparison.

2. If you have two different experimental conditions (or tissues or genotypes) and you want to know whether a given gene changes its expression strength due to the condition, you need to assess biological noise between replicates to know whether the observed difference is significantly stronger than the difference between replicates, i.e., whether it is really due to the change in the experimental condition and not just due to biological variability.

This is what DESeq and edgeR do. Of course they cannot see alternative splicing because they require you to lump all transcripts of a gene together.

The tool that sits awkwardly in between the to use cases is cuffdiff. It tests whether a transcript has different concentration in two samples. The problem with this is that many of its users forget that there is a lot of difference between asking (i) "Is the concentration of transcript X in the two samples different?" and (ii) "Is the difference in concentration sufficiently large to make it unlikely that it is only due to biological variability?" Only if you may say yes to (ii), you can attribute your observation to the fact that your samples had different experimental treatment.

This is why I stick to my claim that, as of now, there are no tools suitable to reliably associate changes in splicing isoforms with changes in experimental condition. Of course, this gap will be filled very soon.
Simon Anders is offline   Reply With Quote
Old 08-15-2010, 05:38 PM   #7
Balat
Member
 
Location: Australia

Join Date: May 2010
Posts: 36
Default

Thanks Simon for the explanation. I am looking at the effect of a treatment on gene expression between samples with 3 biological replicates. I can test for the differential expression of genes under different treatments with the available tools but as you suggested there are no tools at this stage for measuring differential expression of isoforms under different treatments.
Balat is offline   Reply With Quote
Old 08-15-2010, 08:27 PM   #8
frozenlyse
Senior Member
 
Location: Australia

Join Date: Sep 2008
Posts: 136
Default

Quote:
Originally Posted by Simon Anders View Post
Of course they cannot see alternative splicing because they require you to lump all transcripts of a gene together.
Could you just use exon level counts? I guess this depends on the depth of sequencing and expression level of that exon, but has anyone tested it out?
frozenlyse is offline   Reply With Quote
Old 10-27-2010, 12:21 PM   #9
mmuratet
Member
 
Location: Huntsville AL

Join Date: Jul 2008
Posts: 13
Default

I have had success using the R limma package on fpkm values.
mmuratet is offline   Reply With Quote
Old 03-15-2011, 11:19 AM   #10
jdsv
Junior Member
 
Location: Wisconsin, USA

Join Date: Mar 2011
Posts: 3
Default

Simon: I have read your posts in a number of threads and message boards / mailing lists, and they have been helpful in clarifying some points I was questioning. I'm doing preliminary research for a possible RNA-Seq project for differential expression based on multiple experimental treatments later this year, and am trying to outline possible workflows. I have eventually settled on two general possibilities: RNA-Seq -> tophat -> htseq-count -> DESeq or RNA-Seq -> tophat -> cufflinks -> cuffdiff. Your comments here and in other places to the effect that cuffdiff is inappropriate for differentiating biological from treatment-induced differences in expression seem logical (at least with only a minimal understanding of the differences in statistical methods each employs), so it seems the former would be appropriate.

Quote:
Originally Posted by Simon Anders View Post
This is why I stick to my claim that, as of now, there are no tools suitable to reliably associate changes in splicing isoforms with changes in experimental condition. Of course, this gap will be filled very soon.
Would you still consider this to be true? I haven't been able to find anything in my search indicating that cuffdiff has changed in its method of handling biological vs. treatment variation, but researchers seem to be using it for this purpose (e.g. http://dx.doi.org/10.1371/journal.pone.0016266).

Thanks,
Jeremy
jdsv is offline   Reply With Quote
Old 03-15-2011, 02:40 PM   #11
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Jeremy

Quote:
Originally Posted by jdsv View Post
Would you still consider this to be true? I haven't been able to find anything in my search indicating that cuffdiff has changed in its method of handling biological vs. treatment variation, but researchers seem to be using it for this purpose (e.g. http://dx.doi.org/10.1371/journal.pone.0016266).
A while ago, the cufflinks authors announced that their new version of cuffdiff now handles biological replicates as well. This is stated on the cufflinks web page and was also said by the cufflinks authors here on SeqAnswers.

However, to my knowledge, they have not yet offered any more specific explanations. Does this mean that cuffdiff now estimates biological variation and accordingly tests in a more stringent way? If so, how does it do that?

I hope that the cuffdiff authors will soon publish some methods paper elaborating on this, but until then, there is no way to judge whether it is sound.

Should you wait for that? Personally, I consider it improper to use a tool before its method has been published; after all, you just trust that the tool's authors did a good job without anybody having double-checked it yet. On the other hand, I understand that a practitioner without sufficient expertise in statistics could not judge the soundness of a method anyway, so a publication doesn't help too much (unless you have great trust in peer review).

Concerning your problem: We are working on a method to test for alternative isoform regulation, and I am aware of at least two other groups who work on similar projects. We hope to release our tool soon, and I guess our competitors are not that far behind. So, you will soon have several methods to chose from, and if somebody can come up with an idea how to construct a suitable gold-standard test data set, we could even resolve by testing the issue of which methods are sound.

Simon
Simon Anders is offline   Reply With Quote
Old 03-18-2011, 12:09 PM   #12
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

Simon,
I agree that its fantastic that the DESeq paper is already published:
http://genomebiology.com/2010/11/10/R106
However I believe it was used by many people long before it appeared in print on the 20th of October 2010. If I recall, it was already being discussed long before that.

Now I know you submitted it on the 20th of April 2010, and it took a long time for it to get accepted (perhaps even longer than a standard paper?) but I still think its good that the method was available for people to use before it appeared in print.

With Cufflinks, we have been completely open and clear about exactly what our methods are doing. In fact, we hold ourselves to the highest standard of openness: namely open source. Any user can look at our software and now exactly what it is doing, and in fact many people have. We do not hide anything, contrary to your suggestions that we do.
Furthermore, we publish our methods in our code before they appear in print, always. This allows users to benefit from our methods before the peer review, and the open source means they can see what we are doing if they are interested in details. For example, we have been distributing the code to do bias correction even before the paper was published. It just appeared yesterday:
http://genomebiology.com/2011/12/3/R22/abstract

Cufflinks has performed isoform specific expression estimates since the first version was released, well over a year ago. Already the original version took into account variability in isoform estimates due to uncertainty because of ambiguous read counts when performing differential expression. It is true, that at the time, our underlying model was Poisson. This is in fact a very good model for a wide range of expression values, even when used for biological replicates. It is even better when coupled with bias correction which we now do.

In our more recent versions of Cufflinks, we have been working on improving our approach to differential expression even beyond the methods in our two papers so far. And to my knowledge, currently our software provides a solution to this (which many people are using), while your software doesn't. Until it does, and until there is code to do it, or a paper, I don't think your comments are of much use to either biologists or statisticians.
lpachter is offline   Reply With Quote
Old 03-18-2011, 12:49 PM   #13
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

lpachter: Why don't you update the cuffdiff documentation to describe how it works and how biological replicates are handled? While it is very nice that the code is open source, it would be much easier to understand what the code is doing if you could give some hints on the underlying model.
kopi-o is offline   Reply With Quote
Old 03-18-2011, 01:00 PM   #14
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

kopi-o: The preliminary replicate support introduced in Cufflinks 0.9.0 is currently being overhauled and expanded to address many of the concerns that Simon and others have raised. As part of Cufflinks 1.0, which will be released soon, we will include a complete description of what the software is doing on the website. We haven't decided whether to write a standalone paper about these enhancements.
Cole Trapnell is offline   Reply With Quote
Old 03-18-2011, 01:02 PM   #15
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Thanks, that's good to know.
kopi-o is offline   Reply With Quote
Old 03-19-2011, 04:31 AM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Lior,

reading again through my post #11 of Tuesday, I realize that it sounds as if I criticize you for not being open in exposing the methods underlying your software. You are perfectly right that it is not only standard practice but also desirable to make software available to the community as soon as possible even if the methods paper with the full explanation is still in preparation. Hence, please accept my apologies for sounding impatient; this was inappropriate.


I still stand by my claim that any test based on the Poisson distribution is unsuitable to assess whether differential expression may be attributed to the experimental treatment of interest, for the reasons I gave, e.g., in post #6 in this thread. In our DESeq paper we follow Robinson and Smyth in arguing that a proper estimation of overdispersion is essential for this. However, from your previous statements in our discussion on SeqAnswers half a year ago, I understood that the new capabilities of cufflinks are not about dispersion estimation but about bias reduction, and that this is because you do not see the need for the former independent of the latter.

Hence, my proper answer to Jeremy's question should have been that, yes, nothing has changed in the state of matters since our previous discussion. I still think that cuffdiff leaves an important issue unadressed, and, as you just emphasized again, you still disagree and think that my concern was neither valid for the initial nor the current version of cuffdiff. Judging from his posts, Cole seems to have a somewhat different stance. Your and Cole's description of the new biological replicates functionality seem to slightly differ in precisely the point I consider crucial. I hope I'll now find my answer by reading your brand-new paper.


As a side note: When we made DESeq available, end of 2009, we uploaded a draft of our paper to the Nature Precedings server, allowing users to read about the details of our method already then. I agree that this is not standard practice and that there are good reasons for authors to refrain from distributing preprints as liberally. However, at least for us, it turned out to be very beneficial, because it stimulated discussion about and use of DESeq, so that the rather long review process of our paper did not hurt us too much.

Simon
Simon Anders is offline   Reply With Quote
Old 03-30-2011, 01:38 PM   #17
townway
Member
 
Location: Rockville

Join Date: May 2009
Posts: 40
Default

Hi Simon,

I am working on RNA-seq data and try to use you scripts to go over it. But my data is from cell lines which treated with drugs differing in time. I wonder whether DESseq can handle time series data, if so, which part I should change?

Thank you



Quote:
Originally Posted by Simon Anders View Post
Lior,

As a side note: When we made DESeq available, end of 2009, we uploaded a draft of our paper to the Nature Precedings server, allowing users to read about the details of our method already then. I agree that this is not standard practice and that there are good reasons for authors to refrain from distributing preprints as liberally. However, at least for us, it turned out to be very beneficial, because it stimulated discussion about and use of DESeq, so that the rather long review process of our paper did not hurt us too much.

Simon
townway is offline   Reply With Quote
Old 03-30-2011, 11:29 PM   #18
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by townway View Post
I am working on RNA-seq data and try to use you scripts to go over it. But my data is from cell lines which treated with drugs differing in time. I wonder whether DESseq can handle time series data, if so, which part I should change?
DESeq allows you to perform pairwise comparisons, and, to my knowledge, the same is true for all other tools out there. So, you can pick pairs of time points and compare these. Using GLMs, you can also compare differences between pairs of time points for one drug with differences for another drug or for the untreated controls.

But which comparisons (contrasts) are useful to analyze data? Figuring this out is, in my opinion, the main challenge of time course data.

Of course, all these pairwise comparisons are a bit pedestrian, if you have more than a three or four time points. You might be more interested in curve fits, and this is a very different statistical task, with which I have little experience. I haven't seen yet any such analysis published for RNA-Seq data, but there is lots of paper on microarray time courses. Maybe the article by Hafemeister et al in the current issue of Bioinformatics is a good starting point. Translating such methods to the RNA-Seq settings is certainly something that needs to be done now.

S
Simon Anders is offline   Reply With Quote
Old 05-03-2011, 08:25 AM   #19
ecofriendly
Junior Member
 
Location: University of Wisconsin, Madison

Join Date: Nov 2010
Posts: 9
Default

I have a question for Simon and others regarding how to use shot noise when interpreting the DESeq output, i.e. differentially expressed genes within a given p-value cutoff.

It seems that Figure 1 of the DESeq vignette is important because it tells you at what expression level the result becomes uninterpretable - i.e. read counts are low and shot noise is high. In my dataset, which compares control and treatment across two biological replicates, shot noise is bigger than biological noise up until mean ~100 counts. Then in Figure 2, I noticed that the red line of estimated variance seems to fit the data, except at the far left of the figure, where the gene counts are lowest. These data say to me that I shouldn't trust gene counts that are lower than about 100, because there is a lot of unexplained variation for some genes at this level.

Yet in the output, I'm finding genes that nbinom test calls significant, even at very low expression levels. For example, there is one gene in my 5% FDR cutoff DGE list that changes from 5 counts to 22 counts. Yet I don't trust that this gene change is significant, given my observations about shot noise.

So my question is, why does DESeq allow gene changes with large shot noise associated with them to be called significant? And should I exclude gene changes at these low counts/ high shot noise from my downstream analysis?

If this question was answered elsewhere please direct me to that response.

Thanks very much,
Elena
ecofriendly is offline   Reply With Quote
Old 05-04-2011, 03:44 AM   #20
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi Elena

first of all: the whole point of DESeq is to take both shot noise and biological noise into account when testing for differential expression. So, there is no need to do any cut-offs on expression strengths.

For weakly expressed genes, shot noise is stronger, and hence, DESeq wants to see sttronger fold changes before it calls a change significant. This is why in Fig. 3 of our paper (which is the same as Fig. 4 of the vignette but using data that is a better example), the boundary between significant (red) and non-significant (black) changes rises sharply to the left.

In your case, going from 5 to 22 is such a strong change (a more than 4-fold increase) that it is significant despite the low counts.

I don't understand, by the way, your sentence "because there is a lot of unexplained variation for some genes at this level." What do you mean by "unexplained variance"?

Simon
Simon Anders is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO