SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
many biological replicates - 'traditional' statistics vs Cuffdiff or DESeq/edgeR? sjm RNA Sequencing 0 03-31-2012 08:42 AM
Differential expression from RNA-seq: variation between replicates beans RNA Sequencing 6 11-03-2011 09:45 AM
Differential Expression analysis without replicates polsum Bioinformatics 1 08-05-2011 03:40 AM
Differential gene expression: Can Cufflinks/Cuffcompare handle biological replicates? marcora Bioinformatics 38 12-14-2010 03:57 PM
Differential gene expression: Can Cufflinks/Cuffcompare handle biological replicates? marcora Bioinformatics 0 05-19-2010 01:11 AM

Reply
 
Thread Tools
Old 05-29-2012, 08:05 AM   #1
mrfox
Senior Member
 
Location: USA

Join Date: Aug 2010
Posts: 103
Default differential gene expression without replicates: edgeR, DESeq?

Hi all,

I am a cufflinks user and I am trying to test other popular gene expression analysis tools such as edgeR and DESeq. In most of my projects we only have one Normal and one Tumor sample. Though there has been a lot of discussions, it is still unclear to me if edgeR or DESeq is "better" than cuffdiff when there are no biological replicates.

Any advice will be appreciated.
mrfox is offline   Reply With Quote
Old 05-30-2012, 04:50 AM   #2
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

In the complete absence of replicates, I don't think any statistical tool is going to be worth a dang for differential gene expression. All you can do is look at simple differences in counts, with no means at all of assessing the significance of those differences. The statistics cannot compensate for a complete lack of adequate data for the analysis in question, and without some minimal number of replicates (3 is really the minimum, 4 or more would be far better), there is no way to assign statistical significance.

I know the vignettes for tools like edgeR talk about good performance "...even for experiments with minimal levels of biological replication" (quoting from the edgeR manual), but note the use of the word "minimum". A complete absence of replication is not minimum, and in the complete absence of replication, you cannot perform statistical tests of significance for differences.

And since you have no statistical power at all, comparing different analytical tools seems pointless to me.
mbblack is offline   Reply With Quote
Old 05-30-2012, 04:58 AM   #3
lexa
Member
 
Location: MPI

Join Date: Jun 2010
Posts: 17
Default

I have to agree with mbblack. you should try to gain more statistical power by getting at least 3 replicates per treatment. otherwise your comparision is not really meaningful.
lexa is offline   Reply With Quote
Old 05-30-2012, 06:21 AM   #4
mrfox
Senior Member
 
Location: USA

Join Date: Aug 2010
Posts: 103
Default

Many thanks, mbblack and lexa.

Lacking of replicates is indeed an issue for some of my projects. Unfortunately, these collaborators will not proceed to sequence replicates until they find something interesting in the current data.
They even wish to have a short, "reliable" list of DE genes or differentially spliced that makes sense, while we are not able to achieve this without replicates. It is really a dilema.

It is important for biologists to discuss with bioinformaticians before they submit the samples for sequencing.
mrfox is offline   Reply With Quote
Old 05-30-2012, 06:25 AM   #5
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

What you could do is run both and show them the resulting gene lists for both and the intersection (venn diagram?)
mgogol is offline   Reply With Quote
Old 05-30-2012, 06:27 AM   #6
lexa
Member
 
Location: MPI

Join Date: Jun 2010
Posts: 17
Default

that's hard. anyway, you could try to get a 'reliable' gene set using different methods and just take the overlap from different methods. maybe, you should take genes verified by at least 2 different methods. then, do a literature search for the genes you found. maybe, some of the genes you find are already described.
lexa is offline   Reply With Quote
Old 05-30-2012, 07:07 AM   #7
Tom Bair
Member
 
Location: Iowa

Join Date: Oct 2008
Posts: 28
Default

edgeR does mention a method for dealing with lack of replication by assigning a variance value
Quote:
simply pick a reasonable dispersion value, based on your experience with similar data, and use that. Although subjective, this is still more defensible than assuming Poisson variation. Typical values are dispersion=0.4 for human data, dispersion=0.1 for data
on genetically identical model organisms or dispersion=0.01 for technical replicates.
More detail in the User Guide, an option anyway, replication is always better.
Tom Bair is offline   Reply With Quote
Old 05-30-2012, 07:10 AM   #8
mrfox
Senior Member
 
Location: USA

Join Date: Aug 2010
Posts: 103
Default

In my mind I tried that a long time ago. I found that the result is sensitive to the selected dispersion coefficient.
mrfox is offline   Reply With Quote
Old 05-30-2012, 08:01 AM   #9
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

Quote:
Originally Posted by mrfox View Post
Many thanks, mbblack and lexa.

Lacking of replicates is indeed an issue for some of my projects. Unfortunately, these collaborators will not proceed to sequence replicates until they find something interesting in the current data.
They even wish to have a short, "reliable" list of DE genes or differentially spliced that makes sense, while we are not able to achieve this without replicates. It is really a dilema.

It is important for biologists to discuss with bioinformaticians before they submit the samples for sequencing.
You need to discuss this with them. Without replicates, there is no way to actually give them the answers they seek. "Reliable" list of DE genes? That cannot possible be derived without some statistical significance assigned to the results, and you cannot have any statistically significant results without replicates. At best, all you could give them would be a ranked list of simple differences in gene counts or RPKM for mapped genes, and with no hint of what the variance about those differences there may be.

They really need to do a proper pilot study, with 3-5 replicates to see just what they have to work with. Otherwise, all you can tell them is what is different, but with no statistical ranking of significance nor any idea of how variable those differences may be.

It is not that you have minimal statistical power without replicates, you have none. All you have is simple numeric differences of some count or normalized values, and nothing more. And you have no idea at all if those differences are real biological differences, or random experimental noise.

And there is nothing unique to RNAseq data about that - you cannot compute statistics on a simple difference between two single numbers.
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
mbblack is offline   Reply With Quote
Old 05-30-2012, 08:12 AM   #10
mrfox
Senior Member
 
Location: USA

Join Date: Aug 2010
Posts: 103
Default

I could not agree more. Inferring a short list of DE genes from an expensive(compared to array data) RNA-Seq sequencing for even one single pair of samples is some collaborators' dream. Some even prefer to spend money on sequencing more cell line types rather than replicates. I find it is hard to persuade them.

Without replicates, what we can provide is only the list of DE genes based on statistical models such as poisson but this will never reflect the truth without sufficient replicates.


Quote:
Originally Posted by mbblack View Post
You need to discuss this with them. Without replicates, there is no way to actually give them the answers they seek. "Reliable" list of DE genes? That cannot possible be derived without some statistical significance assigned to the results, and you cannot have any statistically significant results without replicates. At best, all you could give them would be a ranked list of simple differences in gene counts or RPKM for mapped genes, and with no hint of what the variance about those differences there may be.

They really need to do a proper pilot study, with 3-5 replicates to see just what they have to work with. Otherwise, all you can tell them is what is different, but with no statistical ranking of significance nor any idea of how variable those differences may be.

It is not that you have minimal statistical power without replicates, you have none. All you have is simple numeric differences of some count or normalized values, and nothing more. And you have no idea at all if those differences are real biological differences, or random experimental noise.

And there is nothing unique to RNAseq data about that - you cannot compute statistics on a simple difference between two single numbers.
mrfox is offline   Reply With Quote
Old 05-23-2013, 05:12 AM   #11
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default edgeR without replicates

Knowing that it is unwise to do experiments without replication, I find myself in exactly that situation. (pooled samples).

I've analysed these data with older versions of DE-Seq, but now would also like to try edgeR. I can't seem to decipher exactly how one does this analysis without replicates based on the vignette. Anyone able to help me out/share a script?

It's pretty clear that both DEseq and edgeR camps are now strongly discouraging such efforts (does DEseq2 even stil incorporate such analyses?), but still need to give it a go in this case.

Thanks!
chrisbala is offline   Reply With Quote
Old 05-23-2013, 06:08 AM   #12
mbblack
Senior Member
 
Location: Research Triangle Park, NC

Join Date: Aug 2009
Posts: 245
Default

To be honest, my opinion is that the first option mentioned in the edgeR vignette is really the only valid approach to follow in that situation. To quote from page 18:

"1. Be satised with a descriptive analysis, that might include an MDS plot and an analysis
of fold changes. Do not attempt a signicance analysis. This may be the best advice."

In other words, make your argument for significantly differentially expressed genes based solely on the magnitude of measured differences between samples and accept that you cannot perform any reliable or valid statistical significance testing. I just think it is pointless to spend a lot of time running algorithms or code on a data set that fundamentally cannot be analyzed statistically.

Basically, what is the point of the effort if the stats are meaningless or open to vigorous negative criticism?
__________________
Michael Black, Ph.D.
ScitoVation LLC. RTP, N.C.
mbblack is offline   Reply With Quote
Old 05-23-2013, 07:20 AM   #13
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default

thanks, option 1 is basically what we are doing. but also trying to scrutinize the data in as many ways as possible. we pooled 10 individuals per library, and our results seem not hopeless in that we can see some of the things we know we should see, and these do hold up to DESeq stats ("working without replicates"). but its the novel stuff that is more problematic. We'll be finding out via qPCR and in situs, I suppose, how well these stats hold up. But yes, not so optimistic. should also say that we are have 3 groups, not 2 so we at least have a bit more information on variability.

Last edited by chrisbala; 05-23-2013 at 07:23 AM.
chrisbala is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:38 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO