Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq: Number of DE genes

    Hi,

    I have an RNAseq experiment with 9 different treatments as triplicates. My problem is that if I load all 9 treatments into DESeq I get 2695 DE genes for a given treatment. If I am loading only the libraries of that treatment and the control I get 314 DE genes. EstimateDispersions sharingMode was maximum.
    Later in the analysis I want to build intersects and compare the genes between the different treatments.
    Since I am not familiar with the dispersion concept I want to ask what the "correct" way to do the analysis is? Should I use all the data in one analysis or do each treatment separately?


    Thank you!

  • #2
    You'll need to give more information. Specifically, you would need to mention how you're study was designed, i.e. there are nested factors or just a bunch of different unrelated treatments or what now. I'm guessing that you have some sort of 3x3 design, but that's just a guess. Also, I assume that your triplicates are biological rather than technical replicates, so mention if that's not the case.

    Comment


    • #3
      Sorry for that.
      The treatments are RNAi knock downs of different genes, including receptors, transcription factors and signaling ligands. RNA was extracted from ~50 whole embryos (beetles) per replicate. The replicates are biological.

      Treatments:
      Wnt-pathway: arrow (receptor), frz1/2 (receptors) and wntless (secretion of wnt ligands)
      Hedgehog-pathway: hedgehog (ligand) and cubitus interruptus (transcription factor)

      Other treatments: 2 transcription factors not directly involved in the pathways and an enzyme involved in pigmentation at later stages as a negative control.

      All treatments are compared to wild type RNA at the same stage.
      For one treatment (arrow) we also looked at a later stage.

      Summary: 8 treatments vs wildtype (both 10-11h old embryos)
      1 treatment vs wildtype (both 22- 24h old embryos)

      I assume that all treatments are nested, except the pigmentation enzyme.
      To get good candidates I want to use the intersects of DE genes in the wnt- and hedgehog treatments.

      The other two transcription factors delete anterior or posterior structures. With those maybe I can discriminate between anterior and posterior target genes, but that is a long shot.

      Please let me know if you need further informations!

      Thank you!

      Comment


      • #4
        As a first try, generate a PCA plot as described in the DESeq vignette, and post it here.

        Comment


        • #5
          Hi,

          here are the PCA plots. I generated two. One with all files and one without the old samples (22-24h old embryos).


          Code:
          library(DESeq)
          cds <- newCountDataSet( all_counts, conds_all )
          cds <- estimateSizeFactors( cds )
          cdsFullBlind = estimateDispersions( cds, method = "blind" )
          vsdFull = varianceStabilizingTransformation( cdsFullBlind )
          dists = dist( t( exprs(vsdFull) ) )
          
          pdf(file="PCA_plot.pdf")
          plotPCA(vsdFull, intgroup=c("condition"))
          dev.off()
          Attached Files

          Comment


          • #6
            Dear Gege

            these look like interesting data, yet you as you can see from the PCA plots, there is a lot of intra-group variability in your data, as compared to the between-group variability. There is clearly signal (as some replicates of the same condition do cluster together), but it seems that the experiment is slightly underpowered, and that from a statistical point of view, to get straightforward conclusions, you need either more replicates or tighter control of the experimental conditions to make them more tighly clustered. I realise that from a biologist's point of view this may not be easy and that you probably want to salvage what you have as best as you can. Then I would start with specific (subsets of samples, e.g. various pairwise analyses) comparisons of interest, explore carefully the hit lists and the associated expression level heatmaps, and not take the p-values too literally.

            Re RNAi: Are your data produced by using the same RNAi (pool) per gene? If not, that might explain some of the within-group variability, and focusing on those siRNAs (dsRNA, shRNA) that work best might help.

            OTOH, it is extremely informative to consider multiple siRNAs per target gene, to better control efficiency and specificity of the reagents.

            Best wishes
            Wolfgang
            Wolfgang Huber
            EMBL

            Comment


            • #7
              Rereading your original question, I am now surprised that you reported that you got less significant genes from estimating the dispersion with the full data set than when doing so with using only one treated group at a time, together with the wildtype control. Do you really get less for all treatment groups?

              What is usually happening is this: You are comparing the strength of the differences between two groups with the variation within the group. The dispersion is a way to quantify the latter. If you use all data to estimate the dispersion, you will get the average variability over all groups. If one group is particularly noisy, it will drive up the average and so reduce your power. Then, it is better to look at each sample pair separately, or at least, to exlude the bad group from the dispersion estimation with the full data. On the other hand, if all groups have the same amount of variability, estimating the dispersion from all data gives better power than doing so with each pair of groups separately, because in the latter case, DESeq is a bit over-conservative because it can be less sure that the estimate is good because it is based on so much fewer data.

              In your case, you may have an annoying extra problem: Note how in your pCA plot, some sample groups are well separated while other blend into each other. Unfortunately, the wild type, whicb is involved in all your comparisons, seems to show the larget spread.

              Comment


              • #8
                Many thanks to both of you for the input!

                To your questions:
                @Wolfgang
                Yes, all replicates are from the same RNAi pool. We used 500-800 bp dsRNA of the target gene and injected in females. They pass the RNAi effect on to the offspring from which I took the RNA.
                When I realized that it would have been a lot better to use two independent dsRNA's it was too late already.

                @Simon:
                It was the other way around. I got 2700 DE genes with the complete data set and only 300 with the subset. So your explanation that estimating the dispersion over all samples gives more power makes perfectly sense.
                However, since 300 genes is already a lot to process I would be happy with the over-conservative behavior for the subsets.

                I did the pairwise analysis as suggested by Wolfgang. I used a single treatment, the wild type control and the negative control for each of the plots. They look a lot better now.
                Only one wild type library does not cluster with the other two. I think this is due to technical reasons. This specific library was sequenced on another day as technical triplicates to test inter-lane variation. The sequencing facility wanted to do that. I summed up the counts over all 3 and used them as a single replicate.
                The same is true for one hedgehog, orthodenticle and torso library. The plots show that in those cases these special "technical triplicate sequencings" cluster, instead of the treatments.
                Sorry I didn't mention that earlier.

                Could you please have a look at the new PCA-plots? I would use the DE genes of all the single pairwise comparisons and compare them in a following analysis. Is this justifiable from a statistical point of view?

                Thanks again for your time and effort. I really appreciate it.
                Attached Files

                Comment


                • #9
                  - So, "ebony" is the negative control? It surely has quite some effect overall, but this is expected: after all it is a transcription factor. I wonder how you want to incorporate it in the analysis.

                  - For the technical-triplicate samples: They mess things up, probably not because they are triplicates but rather because something else was done different in their library preparation. You should either exclude them or fit a GLM with a blocking factor for this batch.

                  - Is your wildtype no treatment at all, or treatment with some scrambled or unspecific siRNA? The latter would be better, of course.

                  Comment


                  • #10
                    Hi,

                    Drosophila (and beetle) ebony is a gene involved in pigmentation: beta-alanyl-dopamine synthase activity. It is an enzyme involved in a process which happens only later during development. It is expressed at very low levels at early stages and does not pass the independent filtering as explained in the DESeq vignette.
                    We choose ebony to get rid of any genes that might be involved in the RNAi-, stress-, infection- ... responses of the embryos. Maybe "negative control" is not the correct term for this treatment. Another reason was that we have an immediate read out for the RNAi effect, i.e. the adult beetles turn black.
                    Later I want to analyze only DE genes which do not show up in this treatment. Among DE genes I will focus on transcription factors and signaling genes. As developmental biologist I don't have the tools to analyze enzymes and other proteins.

                    Maybe I should explain that we do not use siRNA. siRNAs are toxic in the beetle. Instead we use very big (500 bp or longer) dsRNA fragments. These knock downs result in a transcript reduction of 85 to 90% of the targeted gene (at least in my case). Furthermore the RNAi effect is systemic.

                    Wildtype is no treatment at all. Now I wonder whether I should compare my treatments to the ebony samples from the beginning and leave the wildtype out of the analysis. At this stage I am not interested in RNAi response stuff anyways.

                    I will have a look at GLM fitting. Maybe I could also drop two of the technical triplicates to see what happens?
                    Unfortunately, the libraries were not generated and sequenced in the same run (day). So for sure there will be effects from that. A statistician explained to me that because of this a lot of small differences can accumulate over all the gene counts.This might explain why in the PCA-plots, sequencing days or library preps cluster, instead of treatments. I will add this meta data to a multifactor design and see if this is the case.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 11:49 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    62 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X