Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Determining Replicates for DESeq?

    Hello, I am relatively new to RNA-Seq data analysis so I apologize in advance if this is a novice question.
    I have been reading several forums on the subject and I think I have the general idea, but I would be happy to get advice from the experts.

    I have RNA-Seq data for multiple samples of a specific cancer. Previously from Gene Expression analysis (on Illumina platforms) the lab ran before I arrived, we have learned in that this cancer can be divided into 3 subgroups (1,2,3). I would like to find the list of differentially expressed genes between RNA-Seq samples pertaining to subgroups 2 and 3, using DESeq.

    I am somewhat confused here as to what I should consider as my 'biological replicates', or whether I should not consider replicates at all for my analysis.

    For each of the subgroups, I only have one sample per patient. So the scenario looks like this:

    Subgroup 2: Sample A, Sample B, Sample C, Sample D

    Subgroup 3: Sample X, Sample Y, Sample Z, Sample F, Sample W

    In this case, should I consider that Samples A,B,C,D are all biological replicates of Subgroup 2, and Samples X,Y,Z,F,W as biological replicates of Subgroup3? Each of the samples in the subgroup pertains to a different patient, and there is no control sample for each patient (and in this case, no pairing between my samples).

    If I am to consider this scenario, any advice on the DESeq parameters? Right now I am just running the defaults as appears in the vignette.

    The alternative is to consider that I don't have any replicates and run the two groups. So the DESeq for calculating dispersions would be like this:

    cds = estimateDispersions( cds, method="blind", sharingMode="fit-only", fitType="local" )

    I have tried both scenarios. In the case where I don't consider any replicates at all, I have ended up with a much larger number of differentially expressed genes at p<0.1 (1380 as opposed to 92).

    Any advice would be appreciated! Thank you in advance! Deena

  • #2
    Originally posted by SEQnovice View Post
    In this case, should I consider that Samples A,B,C,D are all biological replicates of Subgroup 2, and Samples X,Y,Z,F,W as biological replicates of Subgroup3?
    Short answer: Yes.

    Longer answer: I should probably write a more extensive answer on this as you are not the first one to ask with question, discussing why the term "biological replicate" is actually quite an abuse of terminology, that did manage to cause quite some confusion. I'll get to that.

    Comment


    • #3
      I should add: For a comparison of cancer types, three and four samples are usually way too few, and even more so, if you don't have matched healthy tissue samples from the same patients, so I wouldn't be too optimistic about your results.

      Also, are you sure that you got _more_ hits with "blind" than with the standard work-flow? Should be the other way round.

      Comment


      • #4
        Hi Simon,
        Thanks for the very speedy answer! The cancer types are slightly larger (9 vs 6), I was just putting out a generic question, but you are right that they are still quite few in number either way.

        I guess the main confusion is that in this case, the pooled samples are not truly biological replicates. In my scenario I would have originally considered that biological replicates are if I had multiple cancer samples per patient for each subgroup, so Sample A1, A2, etc. ...I look forward to reading your explanation on this.

        And yes, I did get more hits with blind than standard workflow, which was why I started questioning the issue of replicates. I did have a look at the variance between the gene counts for the samples of each of my subgroups, there doesn't seem to be a high degree of variation in these samples with the exception of 8-9 genes that are outliers per subgroup.
        This may be explain why the number of differentially expressed is quite poor?

        I will run it again just to be sure and let you know.
        Thanks,
        Deena

        Comment


        • #5
          "sharing-mode="fit-only"' is extremely sensitive to outliers, which are turned into false positives. This is why we recommend to avoid it (except for the blind mode where it is unavoidable). So all the extra hits are probably false positives.

          This whole stuff with the sharing mode is a bit of a hack, and replacing this with something more well founded was one of the main motivations for developing DESeq2.

          Comment


          • #6
            Thanks, I will look into DESeq2.

            But just for the sake of completing this exercise, I am assuming the following dispersion estimation is correct?

            cds = estimateDispersions( cds, method="blind", sharingMode="maximum", fitType="parametric" )

            Why wouldn't you use "pooled" or "per-condition" here for the method? Just thinking with regards to dealing with the outliers.

            Thanks,
            Deena

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            57 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            56 views
            0 likes
            Last Post seqadmin  
            Working...
            X