SEQanswers

Go Back   SEQanswers > Introductions



Similar Threads
Thread Thread Starter Forum Replies Last Post
read counts per transcript, edgeR, tophat chrisbala Bioinformatics 46 02-27-2015 04:51 AM
Getting raw counts needed for Deseq/EdgeR from TCGA RSEM files dnet Bioinformatics 4 03-27-2014 10:17 AM
The easiest way to produce the normalised counts in edgeR feralBiologist Bioinformatics 0 11-17-2013 03:50 PM
How to rescue multi-reads when using htseq to generate edgeR/DESeq counts? Hilary April Smith Bioinformatics 3 05-06-2013 11:07 AM
converting bam files to non-normalized read counts lpn Bioinformatics 4 10-09-2012 07:52 PM

Reply
 
Thread Tools
Old 02-10-2014, 07:27 AM   #1
antoza
Member
 
Location: France

Join Date: Aug 2013
Posts: 18
Default edgeR mean normalized reads counts for each gene

Hi,
I am trying to analyse my rna seq data using edgeR (4 conditions (VA, PA, BP, NP) (3 replicates each)). The “counts” matrix file is like the following in R

row.names VA1 VA2 VA3 PA1 PA2 PA3 BP1 BP2 BP3 NP1 NP2 NP3
1 mp090_00001 1 16 10 2 1 1 1 2 2 3 0 4
2 mp090_00002 289 433 512 130 253 313 193 267 317 480 595 888
3 mp090_00003 721 566 702 220 360 474 299 512 647 2147 1769 2601
4 mp090_00004 921 716 724 41 56 77 62 84 97 588 528 1175
5 mp090_00005 1 2 2 1 1 1 0 1 0 1 0 0



I would like at first step to calculate normalized reads counts for each gene in each replicate out of the 12 (4X3) relative to the total number of mapped reads in each replicate. Then I am thinking to take just the mean number of reads measured for each of the three replicates, so by this way, I would have the normalised read counts for each condition per gene instead of the raw read counts that are currently present in the counts” matrix file as above.
In order to do that I fired up in R the following code:

> library(edgeR)
>group <- c(rep("VA",3),rep("PA",3),rep("BP",3), rep("NP",3))
>d <- DGEList(counts, group=group,lib.size=colSums(counts))
>dnorm <-calcNormFactors(d)
>dnorm$samples
group lib.size norm.factors
VA1 VA 6974289 0.8485103
VA2 VA 6976361 0.8982784
VA3 VA 7817591 0.8927855
PA1 PA 1687274 1.2081254
PA2 PA 2639866 1.1679706
PA3 PA 3453869 1.1371379
BP1 BP 2578638 1.0063343
BP2 BP 3384353 1.0768956
BP3 BP 4056560 1.0409395
NP1 NP 9047082 0.9355826
NP2 NP 7851071 0.9358234
NP3 NP 12021020 0.9272787

> dnorm$samples$lib.size*dnorm$samples$norm.factors # calculate the effective library sizes
[1] 5917756 6266714 6979432 2038439 3083286 3927525 2594972 3644595 4222634 8464293 7347216 11146836
> lin.norm.factors <- (sum(d$samples$lib.size)/nrow(d$samples))/ d$samples$lib.size
> lin.norm.factors
[1] 0.8183388 0.8180957 0.7300626 3.3825752 2.1619776 1.6524458 2.2133123 1.6863877 1.4069387 0.6308477 0.7269494 0.4747793

I don’t really know if the norm.factors that I am looking for are these that are mentioned above and which exactly is their difference with the lin.norm.factors that have been calculated in the end. I am wondering also how exactly I will make use of these normalization factors I older to find the normalized reads counts per gene per condition. I found that one option is to do the following but when I don’t understand why you have to pass through this?

f <- calcNormFactors(counts)
f <- f/exp(mean(log(f)))
d <- DGEList(counts , group=group,lib.size=colSums(counts) * f)

or to calculate d$pseudo.alt and save the relevant table but it does not work in my case.
Any help of what exactly do I have to do would be so important for me as I am new in edgeR and I am struggled myself to find the way out!!
Thanks in advance

Last edited by antoza; 02-10-2014 at 07:31 AM.
antoza is offline   Reply With Quote
Old 02-10-2014, 07:47 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It's likely easier to just use cpm():

Code:
library(edgeR)
group <- c(rep("VA",3),rep("PA",3),rep("BP",3), rep("NP",3))
d <- DGEList(counts, group=group)
d <- calcNormFactors(d)
norm.counts <- cpm(d)*1e6
Or something close to that.
dpryan is offline   Reply With Quote
Old 02-11-2014, 08:22 AM   #3
antoza
Member
 
Location: France

Join Date: Aug 2013
Posts: 18
Default

Thanks dpryan for your reply,

I have used your suggestion in running the code as below without face problems:

>library(edgeR)
>group <- c(rep("VA",3),rep("PA",3),rep("BP",3), rep("NP",3))
>d <- DGEList(counts=countsTable, group=group,lib.size=colSums(countsTable))
>f <- calcNormFactors(countsTable)
>f <- f/exp(mean(log(f)))
>d <- DGEList(counts=countsTable, group=group,lib.size=colSums(countsTable) * f)
>pair <- c("VA", "PA")
>d1 <- estimateCommonDisp(d)
>f
[1] 0.8485103 0.8982784 0.8927855 1.2081254 1.1679706 1.1371379 1.0063343 1.0768956 1.0409395 0.9355826 0.9358234 0.9272787 # I guess these are the normalization factors per each sample of my 12 (right?)
>norm.counts <- cpm(d)*1e6
#########################
However, after all I still have the following inquires:
1.When I remark the components of the object “f” among the others are mentioned the $pseudo.counts for all genes / per the 12 samples (what exactly are these numbers mean and which is their deference from normalized counts according to your suggestion?). There is also an indication of $pseudo.lib.size where I don’t quite understand ([1] 4867506). I checked also that the following code
cpm <- 1e06*t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors) )
might gives similar results as the syntax that you gave me. I am not quite sure whether the outputs of one or the other way would be the genes normalised counts or if the pseudo.counts which are mentioned above.

2. I have also find that you can calculate normalized counts running the following code which gave different results both from the pseudo.counts and norm.counts (cpm) results. I am quite confused about the difference among these counts matrixes per gene / per sample. Please could you please further clarify this towards this..
>scale <- d$samples$lib.size*d$samples$norm.factors
>normCounts <- round(t(t(countsTable)/scale)*mean(scale))

3. Finally, my common dispersion is 0.05073274 (is that mean that I have in general low coefficient of variation among my 3 replicates for all the 4 conditions (I don’t be fully sure about the biological value of common dispersion?)

Thank you so much……
antoza is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:27 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO