Hi folks,
I have a problem with edgeR, and I hope someone here can help me with it.
I try to compare 11 samples (4 conditions, 3 replicates each, besides 1, where one sample failed).
The sequencing depth is highly different (in 1 condition only 2000-3000 reads * map, in another condition 120.000-550.000 reads).
I performed the DE analysis with edgeR, and there are actually differentially expressed genes. But if I take a closer look at it, it's always so that the genes are upregulated in the samples, which have higher sequencing depth/more reads.
So it seems the normalization doesn't work.
One of my colleagues took a look at the smear plot (attached), and immediatly said "this is wrong". If I take a look on how it should look, e.g. the figure from the TMM paper http://genomebiology.com/2010/11/3/r25 , then I'd also say "yes, this is wrong".
If I look at the pseudocounts, I'd say that the normalization didn't work properly, because the counts are still extremely different between the conditions.
I've also checked again with only putting in the counts from the 2 conditions which have the most reads (#1: 57770, 19668, 82123, #2 124571, 554628 ; under the assumption that this change the result of the normalization), but the end result doesn't change considerably.
Any clue what could have gone wrong there?
I'm especially worried, because I've already worked with edgeR for another dataset, which is less uneven, but shows also some bias in the distribution of DE genes (not really extreme, but now I don't know if that's biology or technical problems).
Background:
Used R commands, as taken from the tutorial:
- why is the coverage so uneven and so low:
It's a meta-transcriptomics sample, and they apparenlty sequenced to 99.9% host, because the samples were mainly scraped from the host tissue.
- what did I do to get the counts?
First we did a denovo cross-assembly, but I was told afterwards that these were gnotobiotic animals with 4 different known bacteria. The assembly confirmed this, but for easier working we mapped the data back to the original genomes (concatenated together, then just ran bowtie2 with default parameters, and then got the read coverage with subread). Unmapped reads are getting assembled right now, to catch any biological differences between the bacteria in the samples and in the database, but that shouldn't affect the current data massively.
*I'm aware that the statistics will either suck or will not be existent for this.
I'm really worried at the moment, any idea would help.
Thanks !
I have a problem with edgeR, and I hope someone here can help me with it.
I try to compare 11 samples (4 conditions, 3 replicates each, besides 1, where one sample failed).
The sequencing depth is highly different (in 1 condition only 2000-3000 reads * map, in another condition 120.000-550.000 reads).
I performed the DE analysis with edgeR, and there are actually differentially expressed genes. But if I take a closer look at it, it's always so that the genes are upregulated in the samples, which have higher sequencing depth/more reads.
So it seems the normalization doesn't work.
One of my colleagues took a look at the smear plot (attached), and immediatly said "this is wrong". If I take a look on how it should look, e.g. the figure from the TMM paper http://genomebiology.com/2010/11/3/r25 , then I'd also say "yes, this is wrong".
If I look at the pseudocounts, I'd say that the normalization didn't work properly, because the counts are still extremely different between the conditions.
I've also checked again with only putting in the counts from the 2 conditions which have the most reads (#1: 57770, 19668, 82123, #2 124571, 554628 ; under the assumption that this change the result of the normalization), but the end result doesn't change considerably.
Any clue what could have gone wrong there?
I'm especially worried, because I've already worked with edgeR for another dataset, which is less uneven, but shows also some bias in the distribution of DE genes (not really extreme, but now I don't know if that's biology or technical problems).
Background:
Used R commands, as taken from the tutorial:
PHP Code:
x <- read.delim("input.csv",row.names="Contig")
group <- c(1,1,1,2,2,2,3,3,4,4,4)
library("edgeR")
d <- DGEList(counts=x, group=group)
d$samples$lib.size <- colSums(d$counts)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d, verbose=TRUE)
d <- estimateTagwiseDisp(d, trend="none")
result21 <- exactTest(d, pair=c(2,1) )
plotSmear(d,pair=c(2,1),de.tags=rownames(topTags(result21,n=min(dim(result21$table[result21$table[,3]<0.05,])[1],500))$table),pch=19,cex=1,lowess=TRUE,smooth.scatter=TRUE)
It's a meta-transcriptomics sample, and they apparenlty sequenced to 99.9% host, because the samples were mainly scraped from the host tissue.
- what did I do to get the counts?
First we did a denovo cross-assembly, but I was told afterwards that these were gnotobiotic animals with 4 different known bacteria. The assembly confirmed this, but for easier working we mapped the data back to the original genomes (concatenated together, then just ran bowtie2 with default parameters, and then got the read coverage with subread). Unmapped reads are getting assembled right now, to catch any biological differences between the bacteria in the samples and in the database, but that shouldn't affect the current data massively.
*I'm aware that the statistics will either suck or will not be existent for this.
I'm really worried at the moment, any idea would help.
Thanks !
Comment