SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-seq normalization: How to use TMM and rpkm() in EdgeR?? alisonewr Bioinformatics 2 10-27-2014 08:48 AM
EdgeR and Deseq Normalization problems moriah Bioinformatics 4 03-26-2014 03:32 AM
edgeR VS deseq normalization narges Bioinformatics 0 07-04-2013 05:27 AM
EdgeR Normalization Question R_user RNA Sequencing 1 05-07-2012 10:14 PM
edgeR Normalization question polsum Bioinformatics 1 05-07-2012 12:21 AM

Reply
 
Thread Tools
Old 03-20-2015, 05:51 AM   #1
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default edgeR: Normalization does not seem to work

Hi folks,

I have a problem with edgeR, and I hope someone here can help me with it.
I try to compare 11 samples (4 conditions, 3 replicates each, besides 1, where one sample failed).
The sequencing depth is highly different (in 1 condition only 2000-3000 reads * map, in another condition 120.000-550.000 reads).
I performed the DE analysis with edgeR, and there are actually differentially expressed genes. But if I take a closer look at it, it's always so that the genes are upregulated in the samples, which have higher sequencing depth/more reads.
So it seems the normalization doesn't work.
One of my colleagues took a look at the smear plot (attached), and immediatly said "this is wrong". If I take a look on how it should look, e.g. the figure from the TMM paper http://genomebiology.com/2010/11/3/r25 , then I'd also say "yes, this is wrong".
If I look at the pseudocounts, I'd say that the normalization didn't work properly, because the counts are still extremely different between the conditions.

I've also checked again with only putting in the counts from the 2 conditions which have the most reads (#1: 57770, 19668, 82123, #2 124571, 554628 ; under the assumption that this change the result of the normalization), but the end result doesn't change considerably.



Any clue what could have gone wrong there?
I'm especially worried, because I've already worked with edgeR for another dataset, which is less uneven, but shows also some bias in the distribution of DE genes (not really extreme, but now I don't know if that's biology or technical problems).


Background:

Used R commands, as taken from the tutorial:
PHP Code:
<- read.delim("input.csv",row.names="Contig")
group <- c(1,1,1,2,2,2,3,3,4,4,4
library("edgeR")
<- DGEList(counts=xgroup=group)
d$samples$lib.size <- colSums(d$counts)
<- calcNormFactors(d)
<- estimateCommonDisp(dverbose=TRUE)
<- estimateTagwiseDisp(dtrend="none")
result21 <- exactTest(dpair=c(2,1) )
plotSmear(d,pair=c(2,1),de.tags=rownames(topTags(result21,n=min(dim(result21$table[result21$table[,3]<0.05,])[1],500))$table),pch=19,cex=1,lowess=TRUE,smooth.scatter=TRUE
- why is the coverage so uneven and so low:
It's a meta-transcriptomics sample, and they apparenlty sequenced to 99.9% host, because the samples were mainly scraped from the host tissue.

- what did I do to get the counts?
First we did a denovo cross-assembly, but I was told afterwards that these were gnotobiotic animals with 4 different known bacteria. The assembly confirmed this, but for easier working we mapped the data back to the original genomes (concatenated together, then just ran bowtie2 with default parameters, and then got the read coverage with subread). Unmapped reads are getting assembled right now, to catch any biological differences between the bacteria in the samples and in the database, but that shouldn't affect the current data massively.

*I'm aware that the statistics will either suck or will not be existent for this.


I'm really worried at the moment, any idea would help.
Thanks !
Attached Files
File Type: pdf smearplot_4_vs_1.pdf (130.3 KB, 27 views)

Last edited by bastianwur; 03-20-2015 at 05:58 AM.
bastianwur is offline   Reply With Quote
Old 03-21-2015, 11:28 PM   #2
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

Your data sounds pretty crazy and it is quite possible that TMM is not a strong enough normalization. You might try quantile, but it is hard to imagine any normalization that will robustly handle systematic imbalances in sequencing depth as strong as in your data.

However your conclusion that the DE genes shouldn't be mostly in one direction is not correct. It is perfectly possible almost all the DE to be one direction even when normalization has worked perfectly; in fact it is one of the aims of normalization to allow this.
Gordon Smyth is offline   Reply With Quote
Old 03-23-2015, 04:24 AM   #3
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Okay, thanks, I'll have a try .

Quote:
Originally Posted by Gordon Smyth View Post
However your conclusion that the DE genes shouldn't be mostly in one direction is not correct. It is perfectly possible almost all the DE to be one direction even when normalization has worked perfectly; in fact it is one of the aims of normalization to allow this.
Yeah, I'm aware that this could be, that the impact is just in the other direction distributed over more genes, and therefore no statistical significance can be seen. But I don't believe that for our data, we don't expect that, it wouldn't make any sense, and that the regulation really only followed the sequencing depth makes it really very suspicous.

But as said, I'll gonna try, thanks .
bastianwur is offline   Reply With Quote
Old 03-31-2015, 07:22 AM   #4
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Okay...doesn't seem to work. Was sort of expected :/.

Any idea what I could do with this data?
The data itself is:
- 3 sampling sites in a gut, from the tissue
- 1 sampling site in the gut, from the content
(all in triplicate, besides one, where a sample failed...and one more sampling site from the content in triplicate which had to be discarded due to DNA contamination)

2 sampling sites from the tissue have the mentioned extremly low coverage, so I'd preferably not do anything with it.
The 2 remaining sampling sites (1 tissue, 1 content) are not from the same site (e.g. one was upper gut, the other one lower gut), so you cannot really compare them either.
We also cannot take a look at which genes are present, because the animals have been inoculated with a known mix of bacteria, so there's nothing to gain from that.
I'm still waiting for the assembly of the reads, which didn't map to the genomes, to see the genetic differences, but I'm not sure if that'll be meaningful.

If anyone had an idea to what to try here, then it'd be really appreciated .
bastianwur is offline   Reply With Quote
Old 03-31-2015, 10:25 PM   #5
shunyip
Member
 
Location: New York

Join Date: Oct 2013
Posts: 20
Default

In this paper,
"Anders, S., McCarthy, D. J., Chen, Y., Okoniewski, M., Smyth, G. K., Huber, W., & Robinson, M. D. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature protocols, 8(9), 1765-1786."
They mentioned that "In edgeR, it is recommended to remove features without at least 1 read per million in n of the samples, where n is the size of the smallest group of replicates."

However, since your sequencing depths are highly different with each other, maybe you can try removing features separately for all 6 comparisons.
shunyip is offline   Reply With Quote
Old 04-08-2015, 07:53 AM   #6
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Thanks .
That helped a bit.

I've talked to the lab person, and we actually now went with only comparing the 2 better groups (which don't fit together).
The organism distribution is so, that in 1 group you have 3/4 organism A and 1/4 organism B, whereas in the 2. group it's the other way around.
No surprise, the DE result was that the organism with the more reads has upregulated genes in both cases.
Then I thought again what I talked about earlier with a colleague: With normalization you try to make the RNA output of a cell (or at least a population) comparable. That is not the same as making the RNA output of the ecosystem comparable, since the different players might vary in the ecosystem. While we cannot do anything about it in our other meta-omics data (because we cannot fully define the organisms), I know what bacteria I have here -> I separated the expression per organism, then did the normalization and DE on it.
That gives some really crappy output (as expected; in one case FDR needs to be 0.1, else I don't have any rsults), but I now at least get up- and downregulation per organism, and they make a tad bit sense (e.g. biofilm formation at a surface).
That'll probably be good enough to fill the poster, and that's then about it, afterwards we'll throw the data away.

Thanks for the help everyone .
bastianwur is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:06 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO