Hi,
I am trying to analyse my rna seq data using edgeR (4 conditions (VA, PA, BP, NP) (3 replicates each)). The “counts” matrix file is like the following in R
row.names VA1 VA2 VA3 PA1 PA2 PA3 BP1 BP2 BP3 NP1 NP2 NP3
1 mp090_00001 1 16 10 2 1 1 1 2 2 3 0 4
2 mp090_00002 289 433 512 130 253 313 193 267 317 480 595 888
3 mp090_00003 721 566 702 220 360 474 299 512 647 2147 1769 2601
4 mp090_00004 921 716 724 41 56 77 62 84 97 588 528 1175
5 mp090_00005 1 2 2 1 1 1 0 1 0 1 0 0
I would like at first step to calculate normalized reads counts for each gene in each replicate out of the 12 (4X3) relative to the total number of mapped reads in each replicate. Then I am thinking to take just the mean number of reads measured for each of the three replicates, so by this way, I would have the normalised read counts for each condition per gene instead of the raw read counts that are currently present in the counts” matrix file as above.
In order to do that I fired up in R the following code:
> library(edgeR)
>group <- c(rep("VA",3),rep("PA",3),rep("BP",3), rep("NP",3))
>d <- DGEList(counts, group=group,lib.size=colSums(counts))
>dnorm <-calcNormFactors(d)
>dnorm$samples
group lib.size norm.factors
VA1 VA 6974289 0.8485103
VA2 VA 6976361 0.8982784
VA3 VA 7817591 0.8927855
PA1 PA 1687274 1.2081254
PA2 PA 2639866 1.1679706
PA3 PA 3453869 1.1371379
BP1 BP 2578638 1.0063343
BP2 BP 3384353 1.0768956
BP3 BP 4056560 1.0409395
NP1 NP 9047082 0.9355826
NP2 NP 7851071 0.9358234
NP3 NP 12021020 0.9272787
> dnorm$samples$lib.size*dnorm$samples$norm.factors # calculate the effective library sizes
[1] 5917756 6266714 6979432 2038439 3083286 3927525 2594972 3644595 4222634 8464293 7347216 11146836
> lin.norm.factors <- (sum(d$samples$lib.size)/nrow(d$samples))/ d$samples$lib.size
> lin.norm.factors
[1] 0.8183388 0.8180957 0.7300626 3.3825752 2.1619776 1.6524458 2.2133123 1.6863877 1.4069387 0.6308477 0.7269494 0.4747793
I don’t really know if the norm.factors that I am looking for are these that are mentioned above and which exactly is their difference with the lin.norm.factors that have been calculated in the end. I am wondering also how exactly I will make use of these normalization factors I older to find the normalized reads counts per gene per condition. I found that one option is to do the following but when I don’t understand why you have to pass through this?
f <- calcNormFactors(counts)
f <- f/exp(mean(log(f)))
d <- DGEList(counts , group=group,lib.size=colSums(counts) * f)
or to calculate d$pseudo.alt and save the relevant table but it does not work in my case.
Any help of what exactly do I have to do would be so important for me as I am new in edgeR and I am struggled myself to find the way out!!
Thanks in advance
I am trying to analyse my rna seq data using edgeR (4 conditions (VA, PA, BP, NP) (3 replicates each)). The “counts” matrix file is like the following in R
row.names VA1 VA2 VA3 PA1 PA2 PA3 BP1 BP2 BP3 NP1 NP2 NP3
1 mp090_00001 1 16 10 2 1 1 1 2 2 3 0 4
2 mp090_00002 289 433 512 130 253 313 193 267 317 480 595 888
3 mp090_00003 721 566 702 220 360 474 299 512 647 2147 1769 2601
4 mp090_00004 921 716 724 41 56 77 62 84 97 588 528 1175
5 mp090_00005 1 2 2 1 1 1 0 1 0 1 0 0
I would like at first step to calculate normalized reads counts for each gene in each replicate out of the 12 (4X3) relative to the total number of mapped reads in each replicate. Then I am thinking to take just the mean number of reads measured for each of the three replicates, so by this way, I would have the normalised read counts for each condition per gene instead of the raw read counts that are currently present in the counts” matrix file as above.
In order to do that I fired up in R the following code:
> library(edgeR)
>group <- c(rep("VA",3),rep("PA",3),rep("BP",3), rep("NP",3))
>d <- DGEList(counts, group=group,lib.size=colSums(counts))
>dnorm <-calcNormFactors(d)
>dnorm$samples
group lib.size norm.factors
VA1 VA 6974289 0.8485103
VA2 VA 6976361 0.8982784
VA3 VA 7817591 0.8927855
PA1 PA 1687274 1.2081254
PA2 PA 2639866 1.1679706
PA3 PA 3453869 1.1371379
BP1 BP 2578638 1.0063343
BP2 BP 3384353 1.0768956
BP3 BP 4056560 1.0409395
NP1 NP 9047082 0.9355826
NP2 NP 7851071 0.9358234
NP3 NP 12021020 0.9272787
> dnorm$samples$lib.size*dnorm$samples$norm.factors # calculate the effective library sizes
[1] 5917756 6266714 6979432 2038439 3083286 3927525 2594972 3644595 4222634 8464293 7347216 11146836
> lin.norm.factors <- (sum(d$samples$lib.size)/nrow(d$samples))/ d$samples$lib.size
> lin.norm.factors
[1] 0.8183388 0.8180957 0.7300626 3.3825752 2.1619776 1.6524458 2.2133123 1.6863877 1.4069387 0.6308477 0.7269494 0.4747793
I don’t really know if the norm.factors that I am looking for are these that are mentioned above and which exactly is their difference with the lin.norm.factors that have been calculated in the end. I am wondering also how exactly I will make use of these normalization factors I older to find the normalized reads counts per gene per condition. I found that one option is to do the following but when I don’t understand why you have to pass through this?
f <- calcNormFactors(counts)
f <- f/exp(mean(log(f)))
d <- DGEList(counts , group=group,lib.size=colSums(counts) * f)
or to calculate d$pseudo.alt and save the relevant table but it does not work in my case.
Any help of what exactly do I have to do would be so important for me as I am new in edgeR and I am struggled myself to find the way out!!
Thanks in advance
Comment