SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Get the RPKM value of the genes analyzed using DESeq or edgeR narges Bioinformatics 8 01-29-2014 09:17 AM
very different numbers of differentially expressed genes by DESeq bliu1 RNA Sequencing 2 08-22-2012 09:41 AM
BIG difference in number of genes mapped with tophat vs. casava 1.8.2 hmortens Bioinformatics 0 07-30-2012 08:24 AM
DESeq and EdgeR: too many differentially expressed genes!?!? cutcopy11 Bioinformatics 5 12-08-2011 01:14 AM
number of TSSs and genes Light General 0 09-02-2010 07:27 AM

Reply
 
Thread Tools
Old 02-12-2013, 04:13 AM   #1
gege_55
Member
 
Location: Göttingen

Join Date: May 2012
Posts: 10
Default DESeq: Number of DE genes

Hi,

I have an RNAseq experiment with 9 different treatments as triplicates. My problem is that if I load all 9 treatments into DESeq I get 2695 DE genes for a given treatment. If I am loading only the libraries of that treatment and the control I get 314 DE genes. EstimateDispersions sharingMode was maximum.
Later in the analysis I want to build intersects and compare the genes between the different treatments.
Since I am not familiar with the dispersion concept I want to ask what the "correct" way to do the analysis is? Should I use all the data in one analysis or do each treatment separately?


Thank you!
gege_55 is offline   Reply With Quote
Old 02-12-2013, 05:47 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You'll need to give more information. Specifically, you would need to mention how you're study was designed, i.e. there are nested factors or just a bunch of different unrelated treatments or what now. I'm guessing that you have some sort of 3x3 design, but that's just a guess. Also, I assume that your triplicates are biological rather than technical replicates, so mention if that's not the case.
dpryan is offline   Reply With Quote
Old 02-12-2013, 06:48 AM   #3
gege_55
Member
 
Location: Göttingen

Join Date: May 2012
Posts: 10
Default

Sorry for that.
The treatments are RNAi knock downs of different genes, including receptors, transcription factors and signaling ligands. RNA was extracted from ~50 whole embryos (beetles) per replicate. The replicates are biological.

Treatments:
Wnt-pathway: arrow (receptor), frz1/2 (receptors) and wntless (secretion of wnt ligands)
Hedgehog-pathway: hedgehog (ligand) and cubitus interruptus (transcription factor)

Other treatments: 2 transcription factors not directly involved in the pathways and an enzyme involved in pigmentation at later stages as a negative control.

All treatments are compared to wild type RNA at the same stage.
For one treatment (arrow) we also looked at a later stage.

Summary: 8 treatments vs wildtype (both 10-11h old embryos)
1 treatment vs wildtype (both 22- 24h old embryos)

I assume that all treatments are nested, except the pigmentation enzyme.
To get good candidates I want to use the intersects of DE genes in the wnt- and hedgehog treatments.

The other two transcription factors delete anterior or posterior structures. With those maybe I can discriminate between anterior and posterior target genes, but that is a long shot.

Please let me know if you need further informations!

Thank you!
gege_55 is offline   Reply With Quote
Old 02-12-2013, 03:23 PM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

As a first try, generate a PCA plot as described in the DESeq vignette, and post it here.
Simon Anders is offline   Reply With Quote
Old 02-13-2013, 01:44 AM   #5
gege_55
Member
 
Location: Göttingen

Join Date: May 2012
Posts: 10
Default

Hi,

here are the PCA plots. I generated two. One with all files and one without the old samples (22-24h old embryos).


Code:
library(DESeq)
cds <- newCountDataSet( all_counts, conds_all )
cds <- estimateSizeFactors( cds )
cdsFullBlind = estimateDispersions( cds, method = "blind" )
vsdFull = varianceStabilizingTransformation( cdsFullBlind )
dists = dist( t( exprs(vsdFull) ) )

pdf(file="PCA_plot.pdf")
plotPCA(vsdFull, intgroup=c("condition"))
dev.off()
Attached Files
File Type: pdf PCA_plot.pdf (5.2 KB, 82 views)
File Type: pdf PCA_plot2.pdf (5.1 KB, 48 views)
gege_55 is offline   Reply With Quote
Old 02-16-2013, 11:55 PM   #6
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Dear Gege

these look like interesting data, yet you as you can see from the PCA plots, there is a lot of intra-group variability in your data, as compared to the between-group variability. There is clearly signal (as some replicates of the same condition do cluster together), but it seems that the experiment is slightly underpowered, and that from a statistical point of view, to get straightforward conclusions, you need either more replicates or tighter control of the experimental conditions to make them more tighly clustered. I realise that from a biologist's point of view this may not be easy and that you probably want to salvage what you have as best as you can. Then I would start with specific (subsets of samples, e.g. various pairwise analyses) comparisons of interest, explore carefully the hit lists and the associated expression level heatmaps, and not take the p-values too literally.

Re RNAi: Are your data produced by using the same RNAi (pool) per gene? If not, that might explain some of the within-group variability, and focusing on those siRNAs (dsRNA, shRNA) that work best might help.

OTOH, it is extremely informative to consider multiple siRNAs per target gene, to better control efficiency and specificity of the reagents.

Best wishes
Wolfgang
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 02-18-2013, 01:27 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Rereading your original question, I am now surprised that you reported that you got less significant genes from estimating the dispersion with the full data set than when doing so with using only one treated group at a time, together with the wildtype control. Do you really get less for all treatment groups?

What is usually happening is this: You are comparing the strength of the differences between two groups with the variation within the group. The dispersion is a way to quantify the latter. If you use all data to estimate the dispersion, you will get the average variability over all groups. If one group is particularly noisy, it will drive up the average and so reduce your power. Then, it is better to look at each sample pair separately, or at least, to exlude the bad group from the dispersion estimation with the full data. On the other hand, if all groups have the same amount of variability, estimating the dispersion from all data gives better power than doing so with each pair of groups separately, because in the latter case, DESeq is a bit over-conservative because it can be less sure that the estimate is good because it is based on so much fewer data.

In your case, you may have an annoying extra problem: Note how in your pCA plot, some sample groups are well separated while other blend into each other. Unfortunately, the wild type, whicb is involved in all your comparisons, seems to show the larget spread.
Simon Anders is offline   Reply With Quote
Old 02-18-2013, 08:32 AM   #8
gege_55
Member
 
Location: Göttingen

Join Date: May 2012
Posts: 10
Default

Many thanks to both of you for the input!

To your questions:
@Wolfgang
Yes, all replicates are from the same RNAi pool. We used 500-800 bp dsRNA of the target gene and injected in females. They pass the RNAi effect on to the offspring from which I took the RNA.
When I realized that it would have been a lot better to use two independent dsRNA's it was too late already.

@Simon:
It was the other way around. I got 2700 DE genes with the complete data set and only 300 with the subset. So your explanation that estimating the dispersion over all samples gives more power makes perfectly sense.
However, since 300 genes is already a lot to process I would be happy with the over-conservative behavior for the subsets.

I did the pairwise analysis as suggested by Wolfgang. I used a single treatment, the wild type control and the negative control for each of the plots. They look a lot better now.
Only one wild type library does not cluster with the other two. I think this is due to technical reasons. This specific library was sequenced on another day as technical triplicates to test inter-lane variation. The sequencing facility wanted to do that. I summed up the counts over all 3 and used them as a single replicate.
The same is true for one hedgehog, orthodenticle and torso library. The plots show that in those cases these special "technical triplicate sequencings" cluster, instead of the treatments.
Sorry I didn't mention that earlier.

Could you please have a look at the new PCA-plots? I would use the DE genes of all the single pairwise comparisons and compare them in a following analysis. Is this justifiable from a statistical point of view?

Thanks again for your time and effort. I really appreciate it.
Attached Files
File Type: pdf heatmaps.pdf (103.8 KB, 26 views)
File Type: pdf PCA_plots.pdf (46.2 KB, 22 views)
gege_55 is offline   Reply With Quote
Old 02-18-2013, 09:27 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

- So, "ebony" is the negative control? It surely has quite some effect overall, but this is expected: after all it is a transcription factor. I wonder how you want to incorporate it in the analysis.

- For the technical-triplicate samples: They mess things up, probably not because they are triplicates but rather because something else was done different in their library preparation. You should either exclude them or fit a GLM with a blocking factor for this batch.

- Is your wildtype no treatment at all, or treatment with some scrambled or unspecific siRNA? The latter would be better, of course.
Simon Anders is offline   Reply With Quote
Old 02-18-2013, 10:49 AM   #10
gege_55
Member
 
Location: Göttingen

Join Date: May 2012
Posts: 10
Default

Hi,

Drosophila (and beetle) ebony is a gene involved in pigmentation: beta-alanyl-dopamine synthase activity. It is an enzyme involved in a process which happens only later during development. It is expressed at very low levels at early stages and does not pass the independent filtering as explained in the DESeq vignette.
We choose ebony to get rid of any genes that might be involved in the RNAi-, stress-, infection- ... responses of the embryos. Maybe "negative control" is not the correct term for this treatment. Another reason was that we have an immediate read out for the RNAi effect, i.e. the adult beetles turn black.
Later I want to analyze only DE genes which do not show up in this treatment. Among DE genes I will focus on transcription factors and signaling genes. As developmental biologist I don't have the tools to analyze enzymes and other proteins.

Maybe I should explain that we do not use siRNA. siRNAs are toxic in the beetle. Instead we use very big (500 bp or longer) dsRNA fragments. These knock downs result in a transcript reduction of 85 to 90% of the targeted gene (at least in my case). Furthermore the RNAi effect is systemic.

Wildtype is no treatment at all. Now I wonder whether I should compare my treatments to the ebony samples from the beginning and leave the wildtype out of the analysis. At this stage I am not interested in RNAi response stuff anyways.

I will have a look at GLM fitting. Maybe I could also drop two of the technical triplicates to see what happens?
Unfortunately, the libraries were not generated and sequenced in the same run (day). So for sure there will be effects from that. A statistician explained to me that because of this a lot of small differences can accumulate over all the gene counts.This might explain why in the PCA-plots, sequencing days or library preps cluster, instead of treatments. I will add this meta data to a multifactor design and see if this is the case.
gege_55 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO