Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem: DESeq2 analysis with very unbalanced design

    Hello,

    I have a question about my DESeq2 analysis.
    I have to compare expression of miRNAs in different disease variants but I have a problem, because the design is not balanced ; this is my data:
    Disease variant 1: 5 replicates
    Disease variant 2: 26 replicates
    Disease variant 3: 5 replicates
    Disease variant 4: 8 replicates
    Disease variant 1,3,4 have fewer patients than the variant 2 because are rare variants of the disease, then their frequency in a population is very low (it is difficult to find replicates!).
    Can deseq2 working well with these unbalanced samples?

    PS. Sorry if there are errors in the text, but I don't speak English very well
    Thank you very much in advance,
    Fischer

  • #2
    Unbalanced designs aren't a problem, you just have lower power with the variants containing fewer samples.

    Comment


    • #3
      Thank you for reply!
      Then do you think that is correct an analysis with deseq2 in my case?
      Practically I only have low accuracy in the results of these variants, right?

      Comment


      • #4
        Sure, I'd still use DESeq2 if this were my dataset.

        Comment


        • #5
          Thank you again for reply! I have another question, if you can help me again..
          Becouse we had a problem in our lab, some of the samples (15%) were extracted with the Hiseq, while the remaining with the Myseq.. so the initial frequencies of miRNAs in the samples are differents because of the use of two different instruments (Hiseq frequencies are higher).. It could be a problem for the data analysis or DESeq2 solves this problem with normalization?

          Comment


          • #6
            By "extracted" I assume you mean "sequenced". Were the HiSeq and MiSeq libraries prepared at the same time? If everything was prepared at the same time and with the same procedure and just sequenced on different machines then the library size normalization will take care of things. If not, then you should add a batch nuisance variable into your model.

            Comment


            • #7
              Originally posted by dpryan View Post
              Were the HiSeq and MiSeq libraries prepared at the same time?
              Yes, they were prepared at the same time and with the same kit.

              Originally posted by dpryan View Post
              If everything was prepared at the same time and with the same procedure and just sequenced on different machines then the library size normalization will take care of things. If not, then you should add a batch nuisance variable into your model.
              The only difference is in the sequencer machine. We used both Hiseq and Miseq, so some samples have an higher number of reads than other.

              Comment


              • #8
                OK, in theory that should be OK. In practice, though, it's good to make a PCA plot and then see if samples start clustering by machine. If that's the case then you have a notable machine effect and can just add a variable to your model. Alternatively, you could see if svaseq finds a meaningful batch effect worthy of compensation.

                Comment


                • #9
                  Ok, I created a new variable that identify Hiseq/Miseq and I redid the model with these commands ("categories" is the "disease variants" variable, "machine" is the new variable ):

                  pg2 <- newCountDataSet(countTable,categories)

                  countD <- counts(pg2)

                  colData <- data.frame(rownames=colnames(countD), condition=categories, mach=machine)

                  cds <- DESeqDataSetFromMatrix(countData=countD,colData=colData, design=~condition+mach)

                  dds <- DESeq(cds)

                  Is the model correct?
                  then I made PCA:

                  rld <- rlog(dds)
                  plotPCA(rld, intgroup=c("mach"))

                  this is the result:

                  Attached Files
                  Last edited by Fischer; 09-15-2015, 05:54 AM.

                  Comment


                  • #10
                    I guess there is a batch effect (glad I suggested you check!). You might also figure out what's going on with those 2 samples leading to PC1.

                    Comment


                    • #11
                      Thank you so much for your suggestion!
                      Because these two samples have a strange behavior, in your opinion, can I delete them from analysis? For design it wouldn't be a problem because they are "disease variant 2" samples.

                      this is the results without these two samples, and with a model design:
                      ~variant+mach

                      Attached Files
                      Last edited by Fischer; 09-16-2015, 12:05 AM.

                      Comment


                      • #12
                        You should try to see if there's a good reason why they're doing that first (not to mention also doing some hierarchical clustering). In general, though, I would say that those samples are good candidates for exclusion if they can't be otherwise explained (e.g., due to having much lower coverage).

                        Comment


                        • #13
                          Hi DESeq2 experts,

                          I have a very related question. My group design is as following:
                          Control
                          A, n=4
                          B, n=8

                          KO
                          C, n=4
                          D, n=12

                          Groups A,C are untreated, B,D treated.
                          So far so good, I used DESeq2 to compare AvsB and CvsD and now I am looking at the differences of these comparisons (rather than directly comparing BvsD, which I am also doing, but that's not the question here).
                          As you can imagine I get a more DE genes in CvsD, as D has 50% more samples than B, while A and C have the same number of samples. But it's a lot more (AvsB: ~1000; CvsD: ~2500, so 2.5x more, using same FDR/log2FC cutoffs of course).

                          So my question is: Is my "meta-comparison", i.e. looking at what is different in both comparisons actually valid? And is the 2.5-fold difference in DE genes more likely to be a result of group D having higher n (so CvsD has more power than AvsB) or could it also be due to experimental condition, which would be great as that would be biologically meaningful (which was of course the hypothesis)?

                          To be more precise: in my CvsD comparison I get a highly interesting group of genes, so good enrichment of this pathway, while in my AvsB comparison I don't get any of those - and now I'm afraid that this might be due to design rather than biology!

                          Any suggestions would be much appreciated.

                          Comment


                          • #14
                            Comparing lists that made based on p-value and fold-change thresholds is the path of last resort. Your design lends itself nicely to a factorial treatment and those are the questions that likely make the most biological sense...so just do that instead.

                            Comment


                            • #15
                              Thanks @dpryan!

                              You are probably right. I'll have a look at factorial design then.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              19 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X