Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Normalization in single cell RNA-seq data

    Hi All,

    I've been testing various differential expression analyses on my single-cell RNA-Seq data either using FPKM (generated by Cufflinks, used with Fluidigm Singular Analysis Toolset or Monocle) or read counts (DESeq) but I've gotten very different readouts from the different programs. I know based on the genes that are coming out that the FPKM is more likely to be correct; however I've seen evidence of 3' bias in my own data and am wary of using FPKM since most people doing single cell RNA-Seq have demonstrated it to be problematic.

    So, I'm really keen to try use a non-FPKM approach but I'm not really sure how much I should or should not be manipulating the data.

    Most of the normalization advice focuses on studying heterogeneity within a population of cells. Brennecke et al (Nature Methods 2013) offer a great DESeq based way to normalize to spikes and technical variability to see highly variable genes within a seemingly homogenous population. Buettner et al (Nature Biotechnology, 2015) have a great followup looking at cell cycle variation.

    While this does interest me eventually, I also just want to know the differential expression between two different cell populations. However, due to the single cell data, it's highly variable. Does this matter? Can I consider each cell a "replicate"? Since it's highly variable, would the statistically significant genes that do come out be quite robust? Or should I be normalizing these populations to my spike-ins? (Although, please note I only have 3 spike in controls, not 92 like the majority of other published papers out there.)

    Anybody else have any experience with this?

    Thanks!

  • #2
    If you're interested in differential expression between two cell populations, then a straightforward DESeq-like comparison using cells as replicates seems appropriate. You're right that there will be more variability and that your hits will tend to be robust. An alternative is the 'scde' package:



    The issue of normalizing to spike-in is a separate question, and generally good if available.

    Comment


    • #3
      Thank you for your input!

      Sometimes it's just reassuring to have someone else confirm that what you are doing isn't completely off base. I've checked out the scde package like you suggested and will use it as a comparison to make sure the same genes are coming through.

      Comment


      • #4
        Hi all,
        I'm new to the world of RNAseq analysis. I'm doing single cell RNAseq as well, but I would like to do differential expression analysis within a population of cells (as opposed to between 2 different populations), to assess the level of heterogeneity in the population. I am planning to use the tophat/cufflinks/monocle pipeline, but would also like to use a raw count method to verify my hits.

        I have 2 questions:

        1) Can I accomplish this with the SCDE package, or is this package only good for testing DE between 2 groups? If I can use SCDE, what will the output look like?

        2)I've read that DESeq can be used for single cell data. I'd appreciate a description of how this works and what the output from this would look like (i.e. would I be able to get a table of p values with rows of genes and columns of cells, or something along those lines?

        Thanks in advance!

        Comment


        • #5
          Originally posted by fanli View Post
          The issue of normalizing to spike-in is a separate question, and generally good if available.
          I'd be interested to hear more about your experience doing this. I have not heard many 'good' stories about normalizing to spike-ins in this space.

          Comment


          • #6
            Originally posted by jparsons View Post
            I'd be interested to hear more about your experience doing this. I have not heard many 'good' stories about normalizing to spike-ins in this space.
            Are there many alternatives for single-cell data? The traditional methods (TMM, quantile normalization, etc.) aren't appropriate, so I thought spike-ins were largely the only game in town. I fully expect that we're going to start producing single-cell sequencing in the next 3-6 months and would love to hear about better ways

            Comment


            • #7
              The issue of normalization in single cell RNA-seq seems to still be a topic up for debate. The 92 ERCC spike-ins seem to be the gold standard for now and a lot of the big groups who are advancing the single cell RNA-seq field seem to rely mostly on these. They use it them to test biological and technical variation, normalize and find the heterogeneity of the cells underneath the huge noise that is inevitable in single cell RNA-seq. Those who have their own biostatisticians on hand do it themselves, but Brennecke et al Nature Methods published an R package that is available to everyone who is less advanced in mathematics and programming. This is what I've been using.

              However, here's the big problem we're facing. The standard C1 Fluidigm protocol recommends only 3 Ambion Spike-Ins and because we were following the protocol exactly, this is what we did. The normalization methods for 92 spike-ins don't necessarily apply very well because we don't have enough data points. When I brought this up at a meeting with one of the people who is involved in developing the bioinformatics of single cell, they were surprised we used the Ambion spikes and told us to simply take them out of our dataset altogether, on the assumption that the number of spikes would be the same between each sample. However, we see varying numbers of spikes between samples (for various reasons, some explained, some not) and so I'm still torn between normalizing to spikes vs using traditional routes. However, when I normalize to even my three spikes, the data appears to be a bit "cleaner" when doing comparative analyses.

              So, if there's anyone out there who hasn't started yet, I highly recommend using the ERCC spike-ins and not the Ambion as recommended by Fluidigm. I know there is a paper currently under review that will hopefully come out in the next few months that extensively deals with the ERCC spike-ins and may hopefully shed some light on this topic.


              In response to amolinaro, differential expression analysis is used for looking at two populations of cells. In SCDE you will have to define those groups (eg treated vs untreated) before it will calculate the data, just like in DESeq. However, if you have one sample that you have taken from a mixed population or perhaps stimulated, then you might want to use the same method as the Brennecke paper I mentioned, or, if your cells are dividing, then check out scLVM by Buettner et al Nature Biotechnology 2015. These methods are specific for looking at highly variable genes within a population of cells. They can also find "new" populations within your group of cells as defined by similar gene expression etc.

              Comment


              • #8
                Nice summary travelk!

                Yeah, the Ambion spike-ins seem to be of questionable utility, I suppose would could measure variability with them but not much else. We've been using ERCC spike-ins in our initial dataset to hopefully add a bit of robustness to things (not that the ERCC spike-ins are perfect).

                Comment


                • #9
                  I think the original purpose of the Ambion spike-ins was purely as a control to ensure that the lysis buffer was getting to all the wells in the C1 chip and that the RT was working efficiently for each cell (which isn't always the case so the spike-ins have been invaluable to us in that way). I don't think they were intended to be used as a normalization tool, but since they are there, it's tempting to use them. They are simply an artificial, theoretically controlled housekeeping gene in a way.

                  Yes, the ERCC spike-ins aren't perfect, but I think they do give a lot of information about the variability of the method in general and specifically in each data set. It's much better to have them and not need them than the other way around (which is what happened with us). I think a lot of new data and bioinformatics methods are going to be coming out in the next year or two and having the right tools available in your data now will allow you to access those methods in the future.

                  Comment


                  • #10
                    Hi everyone,
                    I would be very grateful if anyone could give me some suggestions in our single-cell RNA seq data analysis part.

                    we have 2 groups of single cells (one normal single cells and one disease single cells), we performed single-cell RNA sequencing. Our library is made using SMART-SEQ2 protocol and it is single-end. We have around 4 million reads / single cell.

                    Now, using Differential gene Expression analysis, we are going to find significant genes which are upregulated or downregulated in disease cells group with regards to normal group.
                    So, which normalization technique could you recommend? Our bioinformatician uses TMM to normalize raw counts and he applies R package Monocle to perform DE.
                    He believes that if we use RPKM, we will get many false positive genes, since we are not comparing genes in one sample, but we are comparing different samples. Do you think it is right?

                    Many thanks in advance.

                    Comment


                    • #11
                      Originally posted by immpdaf View Post
                      Hi everyone,
                      I would be very grateful if anyone could give me some suggestions in our single-cell RNA seq data analysis part.

                      we have 2 groups of single cells (one normal single cells and one disease single cells), we performed single-cell RNA sequencing. Our library is made using SMART-SEQ2 protocol and it is single-end. We have around 4 million reads / single cell.

                      Now, using Differential gene Expression analysis, we are going to find significant genes which are upregulated or downregulated in disease cells group with regards to normal group.
                      So, which normalization technique could you recommend? Our bioinformatician uses TMM to normalize raw counts and he applies R package Monocle to perform DE.
                      He believes that if we use RPKM, we will get many false positive genes, since we are not comparing genes in one sample, but we are comparing different samples. Do you think it is right?

                      Many thanks in advance.
                      It is known that TMM normalization factors do not take into account library sizes, I think it will be problematic if your library sizes are diversed.

                      Gary

                      Comment


                      • #12
                        Originally posted by kobeho24 View Post
                        It is known that TMM normalization factors do not take into account library sizes, I think it will be problematic if your library sizes are diversed.

                        Gary
                        Most analysis reviews released in the past year strongly discourage normalizing scRNA data based on library size(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4823857/ http://www.nature.com/nrg/journal/v1...or-information ). Transcript abundance is highly variable across cells, even within homogeneous populations, and by normalizing based on library size you make the assumption that initial abundances are the same.

                        Your goal in scRNAseq should be to try to use a measurement that approximates absolute transcript counts for the cells: ideally with a UMI approach however normalizing with spike-ins is also a solid alternative.

                        Otherwise there's not much of an obvious answer in normalization for C1 data. Its hard to account for non-linear distortion of amplification in C1 data without spikeins or UMIs as there's over 20 PCR cycles involved in library generation. Without either I would try out either FPKM or TPM and see how DE looks with either of them.

                        Comment


                        • #13
                          Originally posted by hideandSEQ View Post
                          Most analysis reviews released in the past year strongly discourage normalizing scRNA data based on library size(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4823857/ http://www.nature.com/nrg/journal/v1...or-information ). Transcript abundance is highly variable across cells, even within homogeneous populations, and by normalizing based on library size you make the assumption that initial abundances are the same.

                          Your goal in scRNAseq should be to try to use a measurement that approximates absolute transcript counts for the cells: ideally with a UMI approach however normalizing with spike-ins is also a solid alternative.

                          Otherwise there's not much of an obvious answer in normalization for C1 data. Its hard to account for non-linear distortion of amplification in C1 data without spikeins or UMIs as there's over 20 PCR cycles involved in library generation. Without either I would try out either FPKM or TPM and see how DE looks with either of them.
                          I disagree with the point that nomalizing the data with library size leads to the assumption that the initial abundance are the same. I suppose it's just a matter of library quantity instead of the actual initial RNA molecule abundance. Since you cannot make sure that every single library of individual cell can be sequnenced and ouput with equal amount of reads. Indeed, normalizing scRNA-seq data is still challenge in isoform or full-length transcrpt analysis, but much better in 5'/3' seq with UMI and spike-ins. I was always wondering if anybody has a kinda solid pipeline for scRNA-seq analysis on isoform level. Appreciate that in advance!

                          Gary

                          Comment


                          • #14
                            Does anyone have a workflow yet for scRNA that allows RT barcoding and UMI labeling on sequencing whole transcript RNA? If so, what kit and analysis tools do you use?

                            Comment


                            • #15
                              Originally posted by seqgirl123 View Post
                              Does anyone have a workflow yet for scRNA that allows RT barcoding and UMI labeling on sequencing whole transcript RNA? If so, what kit and analysis tools do you use?
                              Frankly speaking, due to the short read length of NGS technology, there isn't such a kit or workflow. And it's not practical to do it on 3rd gen seq platform, since the throughput is still too low and not appropriate for and quantitating analysis.

                              Gary

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X