I hope you’re all doing well.
I’m currently analysing some RNA-seq data from brain samples of zebrafish and I am stumbling in some problems. Just a quick disclaimer, this is my first RNA-seq project and I don't have huge experience.
So briefly, this is the project:
16 controls (8 females + 8 males)
16 low dose (8 females + 8 males)
16 high dose (8 females + 8 males)
I used Trimmomatic to trim the reads. Then RSEM to map to the reference transcriptome and to estimate the counts.
I then used UQ+ RUVseq (RUVg method) to normalise the data (although using TMM gave very similar results).
Finally I did the DE analysis using the following methods in edgeR:
-estimateDisp()
-glmFit() – including sex, week of experiment and tank number as covariates
-glmLRT() – coefficients low and high dose against controls
So the problem is the first time I did this I got about 60 5% FDR genes and when I looked at them they were being mainly driven by one (sometimes two) outliers (see example of one gene attached).
But when I do a PCA in the whole data I don’t really see any obvious outliers at all.
When I remove the most problematic sample and repeat the analysis I still get some DE genes that are mainly driven by a one or two samples.
The figure I showed is raw counts, but a plot with cpm or pseudocounts shows very similar things.
What's the best practice in this situation? Is there a tool a can use to identify genes that have outlying samples? Or samples with outlying genes? And if so, what should I remove, samples or genes?
Thanks in advance!
I’m currently analysing some RNA-seq data from brain samples of zebrafish and I am stumbling in some problems. Just a quick disclaimer, this is my first RNA-seq project and I don't have huge experience.
So briefly, this is the project:
16 controls (8 females + 8 males)
16 low dose (8 females + 8 males)
16 high dose (8 females + 8 males)
I used Trimmomatic to trim the reads. Then RSEM to map to the reference transcriptome and to estimate the counts.
I then used UQ+ RUVseq (RUVg method) to normalise the data (although using TMM gave very similar results).
Finally I did the DE analysis using the following methods in edgeR:
-estimateDisp()
-glmFit() – including sex, week of experiment and tank number as covariates
-glmLRT() – coefficients low and high dose against controls
So the problem is the first time I did this I got about 60 5% FDR genes and when I looked at them they were being mainly driven by one (sometimes two) outliers (see example of one gene attached).
But when I do a PCA in the whole data I don’t really see any obvious outliers at all.
When I remove the most problematic sample and repeat the analysis I still get some DE genes that are mainly driven by a one or two samples.
The figure I showed is raw counts, but a plot with cpm or pseudocounts shows very similar things.
What's the best practice in this situation? Is there a tool a can use to identify genes that have outlying samples? Or samples with outlying genes? And if so, what should I remove, samples or genes?
Thanks in advance!