Dampor 11-25-2014 03:43 PM

Question for experts in normalization
I have replicated RNA-seq samples from 6 populations at 2 different treatments (benign and heat stress) repeated at two separate times (year 1 and year 2).
I am looking for the best way to normalize the reads before testing differential gene expression (I will be using GLM)

a. All the samples normalized together.
b. Normalize by year: two separate normalizations (year 1 samples separate from year 2 samples).
c. Normalize by treatment: two separate normalizations (benign samples separate from heat stress samples)

I have a vague idea about which one would work better and need some expert opinion.


dpryan 11-26-2014 12:30 AM

Typically one would normalize all of the samples together. You would also fit the entire dataset with a model rather than subsetting it (by year or treatment). While there are certainly occasions when this works poorly (generally when there's a large read number difference spread over a factor), it's generally the best course.

BTW, I hope you don't plan to run your own GLM. There are many prewritten tools, such as DESeq2 or edgeR that have additional features...and there's no point in reinventing the wheel.

