SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
help: adjusted p-values in edgeR pbrand Bioinformatics 18 08-18-2014 08:05 PM
edgeR: fold change reported by exactTest for zero values of rna-seq feralBiologist Bioinformatics 2 01-31-2014 09:35 AM
EdgeR -- normalisation and paired test. raphael123 Bioinformatics 1 01-17-2014 12:51 AM
How to adjust the p values of edgeR? wmseq Bioinformatics 4 11-12-2013 11:16 AM
condition of use about RNA-Seq normalisation ickou Bioinformatics 0 11-23-2011 01:57 AM

Reply
 
Thread Tools
Old 08-15-2014, 08:32 AM   #1
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default zero rna-seq values AFTER normalisation in edgeR

I am using edgeR to analyze RNA-Seq data. This is my script:


library("edgeR")
#############################
#read in metadata & DGE
#############################
composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",")
counts <- readDGE(composite_samples$CountFiles)$counts
#############################
#Filter & Library Size Re-set
#############################
noint <- rownames(counts) %in% (c("no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique"))
cpms <- cpm(counts)
keep <- rowSums(cpms>1)>=3 & !noint
counts <- counts[keep,]
colnames(counts) <- composite_samples$SampleName
d <- DGEList(counts=counts, group=composite_samples$Condition)
d$samples$lib.size <- colSums(d$counts)
#############################
#Normalisation
#############################
d <- calcNormFactors(d)
#############################
#Recording the normalized counts
#############################
all_cpm=cpm(d, normalized.lib.size=TRUE)
all_counts <- cbind(rownames(all_cpm), all_cpm)
colnames(all_counts)[1] <- "Ensembl.Gene.ID"
rownames(all_counts) <- NULL
#############################
#Estimate Dispersion
#############################
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
#############################
#Perform a test
#############################
de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N"))


I believe that the variable "all_counts" shall contain the normalized counts for each sample in each condition. My understanding is also that edgeR adds pseudocounts BEFORE performing the library normalisation. Thus it is possible that some values revert to being zero after normalisation. But I thought that this would happen rarely. Yet in a recent dataset I find an improbably large number of zero values in "all_counts" which made me think that my understanding of how pseudocounts and normalisation work in edgeR might be incorrect. Can, please, somebody comment on this?
feralBiologist is offline   Reply With Quote
Old 08-15-2014, 02:28 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Please don't cross-post on here and on the Bioconductor email list.
dpryan is offline   Reply With Quote
Old 08-16-2014, 04:52 AM   #3
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default the counts reported by edgeR are not normalized

This is the kind response by James MacDonald which I got in the Bioconductor list:

https://stat.ethz.ch/pipermail/bioco...st/061055.html

In short, the scores reported in all_counts are not normalised.
feralBiologist is offline   Reply With Quote
Old 08-16-2014, 09:17 AM   #4
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default

dpryan, I see your point (and appreciate the help you have generously given on so many occasions) but the reason for cross-posting is that not everyone is following all the forums. In this case within a few hours I got help from the Bioconductor list and I was able to proceed with my work. But you never know how long is this going to take. Or whether you will get a response at all. I have had questions that haven't been answered at all.

What I try to do is to always crosspost the answers, too, so that people don't respond in vain and so that other people having the same issue can benefit, too.
feralBiologist is offline   Reply With Quote
Old 08-18-2014, 02:23 AM   #5
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

Quote:
Originally Posted by feralBiologist View Post
dpryan, I see your point (and appreciate the help you have generously given on so many occasions) but the reason for cross-posting is that not everyone is following all the forums.
We do ask users please not to post the same question to multiple forums simultaneously.

Quote:
In this case within a few hours I got help from the Bioconductor list and I was able to proceed with my work. But you never know how long is this going to take. Or whether you will get a response at all. I have had questions that haven't been answered at all.
All reasonable questions sent to the Bioconductor mailing list get an answer. A search suggests that you have posted three questions to the Bioconductor mailing list, and that I have answered all of them myself.

The edgeR developers don't live in the same time zone as you and we can't answer everything within a few hours.

Quote:
What I try to do is to always crosspost the answers, too, so that people don't respond in vain and so that other people having the same issue can benefit, too.
But your cross post of James MacDonald's answer isn't correct. The cpm values are of course normalized, they are just not "normalized counts".
Gordon Smyth is offline   Reply With Quote
Old 08-18-2014, 04:03 AM   #6
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default

Quote:
A search suggests that you have posted three questions to the Bioconductor mailing list, and that I have answered all of them myself.
You are right - and I once again thank you for this. I will not post edgeR questions to seqanswers anymore. In the past I have used seqanswers a lot more often than I have used bioconductor (and not just for edgeR) and not all of my questions have been answered. Quick search in seqanswers shows this. Maybe some of them were not precisely formulated - I don't know. But they made me think that help might not always come.
Quote:
But your cross post of James MacDonald's answer isn't correct.
This is how I understood the answer of James. He says that counts are not affected by the normalization and I explained on the bioconductor thread that I understood "normalisation" to comprise all the transformations performed on the raw counts. Thanks to your kind reply in bioconductor I was reminded that in edgeR "normalisation" refers to multiple transformations and that not all of them are reflected in the cpm() output. I was about to post this clarification but you were faster than me.

Once more - thanks again for your assistance and for helping to create edgeR and other analytic tools that I have used.
feralBiologist is offline   Reply With Quote
Old 08-18-2014, 05:22 PM   #7
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

Quote:
Originally Posted by feralBiologist View Post
Thanks to your kind reply in bioconductor I was reminded that in edgeR "normalisation" refers to multiple transformations and that not all of them are reflected in the cpm() output.
Well, the cpm values are fully normalized. The issue is rather that the cpm values produced by cpm() are just for descriptive purposes. They are not used by any of the core functions in edgeR which estimate parameters or evaluate differential expression.
Gordon Smyth is offline   Reply With Quote
Old 08-19-2014, 10:14 AM   #8
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default

Quote:
Originally Posted by Gordon Smyth View Post
Well, the cpm values are fully normalized. The issue is rather that the cpm values produced by cpm() are just for descriptive purposes. They are not used by any of the core functions in edgeR which estimate parameters or evaluate differential expression.
Now I am confused again. And maybe I am not the only one as the response by James MacDonald in the bioconductor thread indicates. I believe this confusion is due to the fact that "normalization" in edgeR seems to mean different things depending on the context. I might be a bit naive but to me any transformation performed on the raw score prior to computing differential expression can be described as "normalisation". This would include library size scaling, TMM, pseudocounts. You seemed to agree with James' response and he literally said "The counts are not affected by the normalization".

Now you seem to say exactly the opposite. Can you, please, clarify?

What I can say with certainty is that no pseudo-counts seem to have been added to the raw counts otherwise I wouldn't have observed the zeros. What is not clear to me whether both library scaling and TMM normalisation have been applied.
feralBiologist is offline   Reply With Quote
Old 08-19-2014, 02:28 PM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

CPM isn't used to calculate differential expression, so it doesn't fit your definition of normalization (normalization is a generic term that doesn't really fit what you wrote). Nothing in Gordon's reply contradicts James' a reply on the mailing list.
dpryan is offline   Reply With Quote
Old 08-19-2014, 03:49 PM   #10
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default

Quote:
Originally Posted by dpryan View Post
CPM isn't used to calculate differential expression, so it doesn't fit your definition of normalization (normalization is a generic term that doesn't really fit what you wrote). Nothing in Gordon's reply contradicts James' a reply on the mailing list.
Thanks for your response but it still does not clarify the question I asked. OK, let's drop "normalisation" as it is a confusing term. What I really wanted to know is "How do you come from raw counts to cpm()'s output? What are the transformations/manipulations performed?"

One thing mentioned by Gordon Smyth is the library size scaling. Is this all? I had a look at the help info on cpm() - it does not explicitly mention anything else.
feralBiologist is offline   Reply With Quote
Reply

Tags
edger, normalisation, pseudocounts, zero counts

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:14 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO