SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Reply
 
Thread Tools
Old 03-12-2015, 05:45 AM   #1
travelk
Member
 
Location: France

Join Date: Jul 2013
Posts: 20
Default Normalization in single cell RNA-seq data

Hi All,

I've been testing various differential expression analyses on my single-cell RNA-Seq data either using FPKM (generated by Cufflinks, used with Fluidigm Singular Analysis Toolset or Monocle) or read counts (DESeq) but I've gotten very different readouts from the different programs. I know based on the genes that are coming out that the FPKM is more likely to be correct; however I've seen evidence of 3' bias in my own data and am wary of using FPKM since most people doing single cell RNA-Seq have demonstrated it to be problematic.

So, I'm really keen to try use a non-FPKM approach but I'm not really sure how much I should or should not be manipulating the data.

Most of the normalization advice focuses on studying heterogeneity within a population of cells. Brennecke et al (Nature Methods 2013) offer a great DESeq based way to normalize to spikes and technical variability to see highly variable genes within a seemingly homogenous population. Buettner et al (Nature Biotechnology, 2015) have a great followup looking at cell cycle variation.

While this does interest me eventually, I also just want to know the differential expression between two different cell populations. However, due to the single cell data, it's highly variable. Does this matter? Can I consider each cell a "replicate"? Since it's highly variable, would the statistically significant genes that do come out be quite robust? Or should I be normalizing these populations to my spike-ins? (Although, please note I only have 3 spike in controls, not 92 like the majority of other published papers out there.)

Anybody else have any experience with this?

Thanks!
travelk is offline   Reply With Quote
Old 03-12-2015, 09:46 AM   #2
fanli
Senior Member
 
Location: California

Join Date: Jul 2014
Posts: 197
Default

If you're interested in differential expression between two cell populations, then a straightforward DESeq-like comparison using cells as replicates seems appropriate. You're right that there will be more variability and that your hits will tend to be robust. An alternative is the 'scde' package:
http://pklab.med.harvard.edu/scde/index.html


The issue of normalizing to spike-in is a separate question, and generally good if available.
fanli is offline   Reply With Quote
Old 03-24-2015, 02:42 AM   #3
travelk
Member
 
Location: France

Join Date: Jul 2013
Posts: 20
Default

Thank you for your input!

Sometimes it's just reassuring to have someone else confirm that what you are doing isn't completely off base. I've checked out the scde package like you suggested and will use it as a comparison to make sure the same genes are coming through.
travelk is offline   Reply With Quote
Old 06-23-2015, 11:46 AM   #4
amolinaro
Junior Member
 
Location: Toronto, Canada

Join Date: Jun 2015
Posts: 3
Default

Hi all,
I'm new to the world of RNAseq analysis. I'm doing single cell RNAseq as well, but I would like to do differential expression analysis within a population of cells (as opposed to between 2 different populations), to assess the level of heterogeneity in the population. I am planning to use the tophat/cufflinks/monocle pipeline, but would also like to use a raw count method to verify my hits.

I have 2 questions:

1) Can I accomplish this with the SCDE package, or is this package only good for testing DE between 2 groups? If I can use SCDE, what will the output look like?

2)I've read that DESeq can be used for single cell data. I'd appreciate a description of how this works and what the output from this would look like (i.e. would I be able to get a table of p values with rows of genes and columns of cells, or something along those lines?

Thanks in advance!
amolinaro is offline   Reply With Quote
Old 06-29-2015, 12:30 PM   #5
jparsons
Member
 
Location: SF Bay Area

Join Date: Feb 2012
Posts: 62
Default

Quote:
Originally Posted by fanli View Post
The issue of normalizing to spike-in is a separate question, and generally good if available.
I'd be interested to hear more about your experience doing this. I have not heard many 'good' stories about normalizing to spike-ins in this space.
jparsons is offline   Reply With Quote
Old 06-29-2015, 10:54 PM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,478
Default

Quote:
Originally Posted by jparsons View Post
I'd be interested to hear more about your experience doing this. I have not heard many 'good' stories about normalizing to spike-ins in this space.
Are there many alternatives for single-cell data? The traditional methods (TMM, quantile normalization, etc.) aren't appropriate, so I thought spike-ins were largely the only game in town. I fully expect that we're going to start producing single-cell sequencing in the next 3-6 months and would love to hear about better ways
dpryan is offline   Reply With Quote
Old 06-30-2015, 04:56 AM   #7
travelk
Member
 
Location: France

Join Date: Jul 2013
Posts: 20
Default

The issue of normalization in single cell RNA-seq seems to still be a topic up for debate. The 92 ERCC spike-ins seem to be the gold standard for now and a lot of the big groups who are advancing the single cell RNA-seq field seem to rely mostly on these. They use it them to test biological and technical variation, normalize and find the heterogeneity of the cells underneath the huge noise that is inevitable in single cell RNA-seq. Those who have their own biostatisticians on hand do it themselves, but Brennecke et al Nature Methods published an R package that is available to everyone who is less advanced in mathematics and programming. This is what I've been using.

However, here's the big problem we're facing. The standard C1 Fluidigm protocol recommends only 3 Ambion Spike-Ins and because we were following the protocol exactly, this is what we did. The normalization methods for 92 spike-ins don't necessarily apply very well because we don't have enough data points. When I brought this up at a meeting with one of the people who is involved in developing the bioinformatics of single cell, they were surprised we used the Ambion spikes and told us to simply take them out of our dataset altogether, on the assumption that the number of spikes would be the same between each sample. However, we see varying numbers of spikes between samples (for various reasons, some explained, some not) and so I'm still torn between normalizing to spikes vs using traditional routes. However, when I normalize to even my three spikes, the data appears to be a bit "cleaner" when doing comparative analyses.

So, if there's anyone out there who hasn't started yet, I highly recommend using the ERCC spike-ins and not the Ambion as recommended by Fluidigm. I know there is a paper currently under review that will hopefully come out in the next few months that extensively deals with the ERCC spike-ins and may hopefully shed some light on this topic.


In response to amolinaro, differential expression analysis is used for looking at two populations of cells. In SCDE you will have to define those groups (eg treated vs untreated) before it will calculate the data, just like in DESeq. However, if you have one sample that you have taken from a mixed population or perhaps stimulated, then you might want to use the same method as the Brennecke paper I mentioned, or, if your cells are dividing, then check out scLVM by Buettner et al Nature Biotechnology 2015. These methods are specific for looking at highly variable genes within a population of cells. They can also find "new" populations within your group of cells as defined by similar gene expression etc.
travelk is offline   Reply With Quote
Old 06-30-2015, 05:10 AM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,478
Default

Nice summary travelk!

Yeah, the Ambion spike-ins seem to be of questionable utility, I suppose would could measure variability with them but not much else. We've been using ERCC spike-ins in our initial dataset to hopefully add a bit of robustness to things (not that the ERCC spike-ins are perfect).
dpryan is offline   Reply With Quote
Old 06-30-2015, 07:09 AM   #9
travelk
Member
 
Location: France

Join Date: Jul 2013
Posts: 20
Default

I think the original purpose of the Ambion spike-ins was purely as a control to ensure that the lysis buffer was getting to all the wells in the C1 chip and that the RT was working efficiently for each cell (which isn't always the case so the spike-ins have been invaluable to us in that way). I don't think they were intended to be used as a normalization tool, but since they are there, it's tempting to use them. They are simply an artificial, theoretically controlled housekeeping gene in a way.

Yes, the ERCC spike-ins aren't perfect, but I think they do give a lot of information about the variability of the method in general and specifically in each data set. It's much better to have them and not need them than the other way around (which is what happened with us). I think a lot of new data and bioinformatics methods are going to be coming out in the next year or two and having the right tools available in your data now will allow you to access those methods in the future.
travelk is offline   Reply With Quote
Old 02-19-2016, 03:39 AM   #10
immpdaf
Junior Member
 
Location: Stockholm

Join Date: Sep 2015
Posts: 4
Default

Hi everyone,
I would be very grateful if anyone could give me some suggestions in our single-cell RNA seq data analysis part.

we have 2 groups of single cells (one normal single cells and one disease single cells), we performed single-cell RNA sequencing. Our library is made using SMART-SEQ2 protocol and it is single-end. We have around 4 million reads / single cell.

Now, using Differential gene Expression analysis, we are going to find significant genes which are upregulated or downregulated in disease cells group with regards to normal group.
So, which normalization technique could you recommend? Our bioinformatician uses TMM to normalize raw counts and he applies R package Monocle to perform DE.
He believes that if we use RPKM, we will get many false positive genes, since we are not comparing genes in one sample, but we are comparing different samples. Do you think it is right?

Many thanks in advance.
immpdaf is offline   Reply With Quote
Old 11-23-2016, 08:27 PM   #11
kobeho24
Member
 
Location: HKUST, Hong Kong

Join Date: Apr 2015
Posts: 32
Default

Quote:
Originally Posted by immpdaf View Post
Hi everyone,
I would be very grateful if anyone could give me some suggestions in our single-cell RNA seq data analysis part.

we have 2 groups of single cells (one normal single cells and one disease single cells), we performed single-cell RNA sequencing. Our library is made using SMART-SEQ2 protocol and it is single-end. We have around 4 million reads / single cell.

Now, using Differential gene Expression analysis, we are going to find significant genes which are upregulated or downregulated in disease cells group with regards to normal group.
So, which normalization technique could you recommend? Our bioinformatician uses TMM to normalize raw counts and he applies R package Monocle to perform DE.
He believes that if we use RPKM, we will get many false positive genes, since we are not comparing genes in one sample, but we are comparing different samples. Do you think it is right?

Many thanks in advance.
It is known that TMM normalization factors do not take into account library sizes, I think it will be problematic if your library sizes are diversed.

Gary
kobeho24 is offline   Reply With Quote
Old 12-09-2016, 08:02 AM   #12
hideandSEQ
Junior Member
 
Location: New Haven

Join Date: Mar 2016
Posts: 8
Default

Quote:
Originally Posted by kobeho24 View Post
It is known that TMM normalization factors do not take into account library sizes, I think it will be problematic if your library sizes are diversed.

Gary
Most analysis reviews released in the past year strongly discourage normalizing scRNA data based on library size(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4823857/ http://www.nature.com/nrg/journal/v1...or-information ). Transcript abundance is highly variable across cells, even within homogeneous populations, and by normalizing based on library size you make the assumption that initial abundances are the same.

Your goal in scRNAseq should be to try to use a measurement that approximates absolute transcript counts for the cells: ideally with a UMI approach however normalizing with spike-ins is also a solid alternative.

Otherwise there's not much of an obvious answer in normalization for C1 data. Its hard to account for non-linear distortion of amplification in C1 data without spikeins or UMIs as there's over 20 PCR cycles involved in library generation. Without either I would try out either FPKM or TPM and see how DE looks with either of them.
hideandSEQ is offline   Reply With Quote
Old 12-10-2016, 03:09 AM   #13
kobeho24
Member
 
Location: HKUST, Hong Kong

Join Date: Apr 2015
Posts: 32
Default

Quote:
Originally Posted by hideandSEQ View Post
Most analysis reviews released in the past year strongly discourage normalizing scRNA data based on library size(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4823857/ http://www.nature.com/nrg/journal/v1...or-information ). Transcript abundance is highly variable across cells, even within homogeneous populations, and by normalizing based on library size you make the assumption that initial abundances are the same.

Your goal in scRNAseq should be to try to use a measurement that approximates absolute transcript counts for the cells: ideally with a UMI approach however normalizing with spike-ins is also a solid alternative.

Otherwise there's not much of an obvious answer in normalization for C1 data. Its hard to account for non-linear distortion of amplification in C1 data without spikeins or UMIs as there's over 20 PCR cycles involved in library generation. Without either I would try out either FPKM or TPM and see how DE looks with either of them.
I disagree with the point that nomalizing the data with library size leads to the assumption that the initial abundance are the same. I suppose it's just a matter of library quantity instead of the actual initial RNA molecule abundance. Since you cannot make sure that every single library of individual cell can be sequnenced and ouput with equal amount of reads. Indeed, normalizing scRNA-seq data is still challenge in isoform or full-length transcrpt analysis, but much better in 5'/3' seq with UMI and spike-ins. I was always wondering if anybody has a kinda solid pipeline for scRNA-seq analysis on isoform level. Appreciate that in advance!

Gary
kobeho24 is offline   Reply With Quote
Old 12-10-2016, 08:09 AM   #14
seqgirl123
Member
 
Location: U.S

Join Date: Oct 2008
Posts: 74
Default

Does anyone have a workflow yet for scRNA that allows RT barcoding and UMI labeling on sequencing whole transcript RNA? If so, what kit and analysis tools do you use?
seqgirl123 is offline   Reply With Quote
Old 12-10-2016, 08:55 PM   #15
kobeho24
Member
 
Location: HKUST, Hong Kong

Join Date: Apr 2015
Posts: 32
Default

Quote:
Originally Posted by seqgirl123 View Post
Does anyone have a workflow yet for scRNA that allows RT barcoding and UMI labeling on sequencing whole transcript RNA? If so, what kit and analysis tools do you use?
Frankly speaking, due to the short read length of NGS technology, there isn't such a kit or workflow. And it's not practical to do it on 3rd gen seq platform, since the throughput is still too low and not appropriate for and quantitating analysis.

Gary
kobeho24 is offline   Reply With Quote
Old 06-05-2018, 11:55 AM   #16
Bidfudge
Junior Member
 
Location: France

Join Date: Jun 2016
Posts: 5
Default Feedback on these experiments

Quote:
Originally Posted by travelk View Post
I think the original purpose of the Ambion spike-ins was purely as a control to ensure that the lysis buffer was getting to all the wells in the C1 chip and that the RT was working efficiently for each cell (which isn't always the case so the spike-ins have been invaluable to us in that way). I don't think they were intended to be used as a normalization tool, but since they are there, it's tempting to use them. They are simply an artificial, theoretically controlled housekeeping gene in a way.

Yes, the ERCC spike-ins aren't perfect, but I think they do give a lot of information about the variability of the method in general and specifically in each data set. It's much better to have them and not need them than the other way around (which is what happened with us). I think a lot of new data and bioinformatics methods are going to be coming out in the next year or two and having the right tools available in your data now will allow you to access those methods in the future.

Hi Travelk,

Did you published the paper related to the data generated with the C1? I'm facing the same thing, I will have full length RNAseq data from the C1 soon and we used the Spike from Ambion as recommended in the fluidigm's protocol.

Does the spike from Ambion allow you to normalize these datas?

Thanks,
Bidfudge is offline   Reply With Quote
Reply

Tags
normalization, single-cell

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO