Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Power calculations for expt design

    Hi there
    I gather that most people aren't bothering with replication for quantitative RNA-Seq experiments, that is sequencing multiple biological replicate samples for each treatment under investigation. Of course it makes the expt ridiculously expensive! But I think it's really important. A very patient statistician is helping me with design of a digital gene expression profiling experiment (RNA-Seq - either SOLiD or Illumina, haven't decided yet). The design includes 2 treatments, a number of biological reps for each treatment, and the aim is to detect differentially expressed genes between the 2 treatments.

    I'd like to do some power calculations to determine the minimum number of reps for each treatment I can get away with, with and without use of sample multiplexing (i.e. multiplexing replicate samples in the same lane). For these calculations, I need an estimate of the between-sample variability of the final data, which I could get from an existing data set which uses this design. I'm having trouble finding one...

    Can anyone help, either by providing a data set which uses biological replication, or providing a between-lane standard deviation (from normalised data) from such an expt, or simply by shedding light on variability between reps which one might normally expect to see in Illumina or SOLiD RNA-Seq data? I know it depends on the biological variability between samples, but I figure any information is better than none.

    Thanks
    Anar

  • #2
    Well, maybe everyone isn't using biological replicates, but I certainly am in experiments involving RNASeq of nontransgenic and transgenic mouse tissues...!
    I could send you RPKM expression values (calculated via Tophat/Cufflinks) for n=4 replicates, 2 treatment groups, for a subset of detected transcripts or for everything we found. Post back if you're interested and we can figure out a way to send data (I don't have a good ftp system here, so probably e-mail and compressed files will be the go).

    Comment


    • #3
      By the way, you may also be interested to know that we've multiplexed 4 samples per Illumina GAII lane (i.e. barcoding system), but haven't tried examining single samples per lane.

      Comment


      • #4
        Hi sjm,

        Wow that would be super, if you wouldn't mind sharing the data I would appreciate it very much!

        And even better that you've multiplexed 4 samples/lane as that removes any lane effects

        I think I would like to obtain RPKM values for all genes, if you are open to that. I would like to plot pooled RPKM vs pooled standard deviation for all genes, to see how variability changes for lowly expressed genes compared with highly expressed genes.

        Look forward to hearing from you. Thanks!

        Comment


        • #5
          Great - let's work on getting you some data to play with. These are data that I am working up for publication, so if it doesn't mess up your calculations and you're OK with my data being 'anonymous', I would prefer to not send real gene names/symbols with the RPKMs. That way it won't be obvious which species, transgenes or tissues were used for this experiment. (A little paranoid, I know, but my PI would be horrified if these data were to 'leak' in an understandable form, albeit by some really remote chance...) You'll still be able to monitor variability on lowly vs highly-expressed genes.

          Write back to me at s.m.a.t.k.o.v.i.AT.d=o=m=DOT=w=u=s=t=l=DOT=e=d=u and we can go from there.

          Comment


          • #6
            Technical variation with RPKM calculated via TopHat/Cufflinks

            Hi, sjm,


            When I compare my tech replicates data, on the M-A plot, TopHat/Cufflinks yields quite large variation. I've attached the M-A plot.

            Do you have tech replicates in your exp? Are there relatively large variation on the M-A plot with TopHat? Cause with our own RPKM scripts
            we see very small variations. I would appreciate your commends/experience on this.

            Thanks!

            Lifeng



            Originally posted by sjm View Post
            Well, maybe everyone isn't using biological replicates, but I certainly am in experiments involving RNASeq of nontransgenic and transgenic mouse tissues...!
            I could send you RPKM expression values (calculated via Tophat/Cufflinks) for n=4 replicates, 2 treatment groups, for a subset of detected transcripts or for everything we found. Post back if you're interested and we can figure out a way to send data (I don't have a good ftp system here, so probably e-mail and compressed files will be the go).
            Attached Files

            Comment


            • #7
              I also have replicates for some RNA-Seq data that I'd like to group together, for the purposes of a differential expression test. However, in the Cufflinks manual I've only been able to find information on running "Lane vs Lane" type comparisions rather than "Group vs Group".

              Can you please describe how to use TopHat and Cufflinks when replicates are involved?

              Comment


              • #8
                Hi

                Typically, the noise between technical replicates is barely above the shot noise level (i.e., the noise predicted by the Poisson distribution) while the noise between biological replicates is much larger. This is what Nagalakshmi et al. have already shown in their 2008 Science paper. Mortazavi et al. (Nature Methods, 2008) have also observed shot-noise only between techniccal replicates, so I suppose it is save to assume that any noise significantly exceeding shot noise points to a problem in library preparation.

                However, you won't be able to see this from a cufflinks-derived MA-plot as Lifeng Tian has shown because (I assume) the A axis FPKM-scaled. However, to compare with the shot noise level, you should look at raw counts.

                Our "DESeq" package allows to estimate variance from raw counts and compare with shot noise levels: http://www-huber.embl.de/users/anders/DESeq/

                For more on the maths behind this, see our paper, which I've now made available as a preprint: http://precedings.nature.com/documents/4282/version/1

                Cheers
                Simon

                Comment


                • #9
                  Another point concerning replicates. As they are expensive I recommend you keep the following points in mind:

                  Given that technical replicates vary at shot noise level, making two technical replicates is the same as sampling only one sample but twice as deep. Additional biological replicates, in contrary, give you not only more counts but also inform you on the variability between samples.

                  You need at least one pair of biological replicates to get an idea at all how strong your data varies from one sample to the next. Otherwise, you have no idea of knowing whether the observed difference between your samples of different conditions is due to the change in experimental condition, or whether a difference of the same magnitude would have been observed as well between two different samples under the same condition. This is the very reason why one needs replicates at all, and why it is flawed to just assume the variance to be as predicted by the Poisson distribution rather than to estimate it from biological replicates. (DEGSeq, for example, falls for this flaw.)

                  If you now compare biological replicates, you may or may not find that the variance is above shot-noise level. (See e.g. Figure 8 in our preprint that I referred to above, which illustrates this for the Nagalakshmi data.) If the biological variance is above shot noise level, sequencing deeper won't help as it reduces shot noise and you are limited by biological variance. On the other hand, if the variance between biological replicates does not exceed the shot noise level significantly, you are limited by shot noise, i.e., further biological replicates will not help any more than sampling the existing samples deeper (i.e., fill more lanes).

                  Hence, the comparison with shot noise is vital to answer the question how many replicates are needed.

                  A question orthogonal to this is whether you have enough replicates to average away the effects on covariates for which you cannot control. (See this thread for a discussion of this issue.)

                  Cheers
                  Simon

                  Comment


                  • #10
                    I have a situation where I initially have two main groups (four replicate organisms in each), so that is pretty straightforward. However I would also like to do some within-group comparisons too - different tissue types, males vs females, etc.

                    even with a minimum of two replicates for each comparison... that's a lot of sequencing to do!

                    Comment


                    • #11
                      analysis of biological replicates (groups) via Tophat/Cufflinks

                      Hi,

                      Sorry that I haven't posted for a while. blackgore, for analysis of replicates, I did not use Tophat/Cufflinks for this part of the operation. Having produced a list of genes/transcripts and RPKM values for each sample, I imported these into MS Access (openoffice.org Base works too) and did a crosstab query to get a spreadsheet of RPKMs with genes in rows, samples in columns.

                      From there, calling differences between groups is up to you and your favorite stats package.

                      Does that help?

                      Originally posted by blackgore View Post
                      I also have replicates for some RNA-Seq data that I'd like to group together, for the purposes of a differential expression test. However, in the Cufflinks manual I've only been able to find information on running "Lane vs Lane" type comparisions rather than "Group vs Group".

                      Can you please describe how to use TopHat and Cufflinks when replicates are involved?

                      Comment


                      • #12
                        Hi Simon,
                        I'm wondering why you haven't created an R function for calculating power/sample size for rna-seq experiments based on your negative binomial model? Seems like it would be quite popular...

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        27 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        30 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        26 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X