SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Power Analysis - Sample Size Calculation jroussarie Bioinformatics 2 11-07-2012 11:15 AM
Cuffdiff FPKM and test statistic calculations PRingler RNA Sequencing 2 10-16-2012 02:47 AM
EdgeR design-matrix design extended.wobble RNA Sequencing 3 07-11-2011 06:58 AM
computation power requirement for sequencing analysis slny Bioinformatics 7 06-03-2011 12:04 PM
Power analysis for RNAseq dglemay RNA Sequencing 0 03-03-2011 08:34 PM

Reply
 
Thread Tools
Old 10-19-2009, 04:48 PM   #1
anar
Junior Member
 
Location: New Zealand

Join Date: Aug 2008
Posts: 6
Default Power calculations for expt design

Hi there
I gather that most people aren't bothering with replication for quantitative RNA-Seq experiments, that is sequencing multiple biological replicate samples for each treatment under investigation. Of course it makes the expt ridiculously expensive! But I think it's really important. A very patient statistician is helping me with design of a digital gene expression profiling experiment (RNA-Seq - either SOLiD or Illumina, haven't decided yet). The design includes 2 treatments, a number of biological reps for each treatment, and the aim is to detect differentially expressed genes between the 2 treatments.

I'd like to do some power calculations to determine the minimum number of reps for each treatment I can get away with, with and without use of sample multiplexing (i.e. multiplexing replicate samples in the same lane). For these calculations, I need an estimate of the between-sample variability of the final data, which I could get from an existing data set which uses this design. I'm having trouble finding one...

Can anyone help, either by providing a data set which uses biological replication, or providing a between-lane standard deviation (from normalised data) from such an expt, or simply by shedding light on variability between reps which one might normally expect to see in Illumina or SOLiD RNA-Seq data? I know it depends on the biological variability between samples, but I figure any information is better than none.

Thanks
Anar
anar is offline   Reply With Quote
Old 12-17-2009, 08:22 AM   #2
sjm
Member
 
Location: St Louis, MO

Join Date: Nov 2009
Posts: 27
Default

Well, maybe everyone isn't using biological replicates, but I certainly am in experiments involving RNASeq of nontransgenic and transgenic mouse tissues...!
I could send you RPKM expression values (calculated via Tophat/Cufflinks) for n=4 replicates, 2 treatment groups, for a subset of detected transcripts or for everything we found. Post back if you're interested and we can figure out a way to send data (I don't have a good ftp system here, so probably e-mail and compressed files will be the go).
sjm is offline   Reply With Quote
Old 12-17-2009, 08:23 AM   #3
sjm
Member
 
Location: St Louis, MO

Join Date: Nov 2009
Posts: 27
Default

By the way, you may also be interested to know that we've multiplexed 4 samples per Illumina GAII lane (i.e. barcoding system), but haven't tried examining single samples per lane.
sjm is offline   Reply With Quote
Old 12-17-2009, 07:00 PM   #4
anar
Junior Member
 
Location: New Zealand

Join Date: Aug 2008
Posts: 6
Default

Hi sjm,

Wow that would be super, if you wouldn't mind sharing the data I would appreciate it very much!

And even better that you've multiplexed 4 samples/lane as that removes any lane effects

I think I would like to obtain RPKM values for all genes, if you are open to that. I would like to plot pooled RPKM vs pooled standard deviation for all genes, to see how variability changes for lowly expressed genes compared with highly expressed genes.

Look forward to hearing from you. Thanks!
anar is offline   Reply With Quote
Old 12-18-2009, 01:12 PM   #5
sjm
Member
 
Location: St Louis, MO

Join Date: Nov 2009
Posts: 27
Default

Great - let's work on getting you some data to play with. These are data that I am working up for publication, so if it doesn't mess up your calculations and you're OK with my data being 'anonymous', I would prefer to not send real gene names/symbols with the RPKMs. That way it won't be obvious which species, transgenes or tissues were used for this experiment. (A little paranoid, I know, but my PI would be horrified if these data were to 'leak' in an understandable form, albeit by some really remote chance...) You'll still be able to monitor variability on lowly vs highly-expressed genes.

Write back to me at s.m.a.t.k.o.v.i.AT.d=o=m=DOT=w=u=s=t=l=DOT=e=d=u and we can go from there.
sjm is offline   Reply With Quote
Old 02-12-2010, 11:13 AM   #6
lifeng.tian
Member
 
Location: Philadelphia

Join Date: Jul 2009
Posts: 16
Default Technical variation with RPKM calculated via TopHat/Cufflinks

Hi, sjm,


When I compare my tech replicates data, on the M-A plot, TopHat/Cufflinks yields quite large variation. I've attached the M-A plot.

Do you have tech replicates in your exp? Are there relatively large variation on the M-A plot with TopHat? Cause with our own RPKM scripts
we see very small variations. I would appreciate your commends/experience on this.

Thanks!

Lifeng



Quote:
Originally Posted by sjm View Post
Well, maybe everyone isn't using biological replicates, but I certainly am in experiments involving RNASeq of nontransgenic and transgenic mouse tissues...!
I could send you RPKM expression values (calculated via Tophat/Cufflinks) for n=4 replicates, 2 treatment groups, for a subset of detected transcripts or for everything we found. Post back if you're interested and we can figure out a way to send data (I don't have a good ftp system here, so probably e-mail and compressed files will be the go).
Attached Images
File Type: jpg MA_tophat.jpg (11.7 KB, 46 views)
lifeng.tian is offline   Reply With Quote
Old 03-16-2010, 03:08 AM   #7
blackgore
Member
 
Location: UK

Join Date: Sep 2009
Posts: 20
Default

I also have replicates for some RNA-Seq data that I'd like to group together, for the purposes of a differential expression test. However, in the Cufflinks manual I've only been able to find information on running "Lane vs Lane" type comparisions rather than "Group vs Group".

Can you please describe how to use TopHat and Cufflinks when replicates are involved?
blackgore is offline   Reply With Quote
Old 03-16-2010, 04:04 AM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Hi

Typically, the noise between technical replicates is barely above the shot noise level (i.e., the noise predicted by the Poisson distribution) while the noise between biological replicates is much larger. This is what Nagalakshmi et al. have already shown in their 2008 Science paper. Mortazavi et al. (Nature Methods, 2008) have also observed shot-noise only between techniccal replicates, so I suppose it is save to assume that any noise significantly exceeding shot noise points to a problem in library preparation.

However, you won't be able to see this from a cufflinks-derived MA-plot as Lifeng Tian has shown because (I assume) the A axis FPKM-scaled. However, to compare with the shot noise level, you should look at raw counts.

Our "DESeq" package allows to estimate variance from raw counts and compare with shot noise levels: http://www-huber.embl.de/users/anders/DESeq/

For more on the maths behind this, see our paper, which I've now made available as a preprint: http://precedings.nature.com/documents/4282/version/1

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 03-16-2010, 04:20 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Another point concerning replicates. As they are expensive I recommend you keep the following points in mind:

Given that technical replicates vary at shot noise level, making two technical replicates is the same as sampling only one sample but twice as deep. Additional biological replicates, in contrary, give you not only more counts but also inform you on the variability between samples.

You need at least one pair of biological replicates to get an idea at all how strong your data varies from one sample to the next. Otherwise, you have no idea of knowing whether the observed difference between your samples of different conditions is due to the change in experimental condition, or whether a difference of the same magnitude would have been observed as well between two different samples under the same condition. This is the very reason why one needs replicates at all, and why it is flawed to just assume the variance to be as predicted by the Poisson distribution rather than to estimate it from biological replicates. (DEGSeq, for example, falls for this flaw.)

If you now compare biological replicates, you may or may not find that the variance is above shot-noise level. (See e.g. Figure 8 in our preprint that I referred to above, which illustrates this for the Nagalakshmi data.) If the biological variance is above shot noise level, sequencing deeper won't help as it reduces shot noise and you are limited by biological variance. On the other hand, if the variance between biological replicates does not exceed the shot noise level significantly, you are limited by shot noise, i.e., further biological replicates will not help any more than sampling the existing samples deeper (i.e., fill more lanes).

Hence, the comparison with shot noise is vital to answer the question how many replicates are needed.

A question orthogonal to this is whether you have enough replicates to average away the effects on covariates for which you cannot control. (See this thread for a discussion of this issue.)

Cheers
Simon
Simon Anders is offline   Reply With Quote
Old 05-13-2010, 07:52 AM   #10
blackgore
Member
 
Location: UK

Join Date: Sep 2009
Posts: 20
Default

I have a situation where I initially have two main groups (four replicate organisms in each), so that is pretty straightforward. However I would also like to do some within-group comparisons too - different tissue types, males vs females, etc.

even with a minimum of two replicates for each comparison... that's a lot of sequencing to do!
blackgore is offline   Reply With Quote
Old 06-08-2010, 04:01 PM   #11
sjm
Member
 
Location: St Louis, MO

Join Date: Nov 2009
Posts: 27
Default analysis of biological replicates (groups) via Tophat/Cufflinks

Hi,

Sorry that I haven't posted for a while. blackgore, for analysis of replicates, I did not use Tophat/Cufflinks for this part of the operation. Having produced a list of genes/transcripts and RPKM values for each sample, I imported these into MS Access (openoffice.org Base works too) and did a crosstab query to get a spreadsheet of RPKMs with genes in rows, samples in columns.

From there, calling differences between groups is up to you and your favorite stats package.

Does that help?

Quote:
Originally Posted by blackgore View Post
I also have replicates for some RNA-Seq data that I'd like to group together, for the purposes of a differential expression test. However, in the Cufflinks manual I've only been able to find information on running "Lane vs Lane" type comparisions rather than "Group vs Group".

Can you please describe how to use TopHat and Cufflinks when replicates are involved?
sjm is offline   Reply With Quote
Old 07-13-2012, 03:49 PM   #12
A.Presson
Junior Member
 
Location: UT

Join Date: Jul 2012
Posts: 1
Default

Hi Simon,
I'm wondering why you haven't created an R function for calculating power/sample size for rna-seq experiments based on your negative binomial model? Seems like it would be quite popular...
A.Presson is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO