Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 finding differential expression changes with libraries of different sizes

    Hello;

    I am working with an RNA-Seq dataset with samples that have varying numbers of reads and am wondering how that will affect differential expression, and what is a generally acceptable difference between samples.

    Four groups of samples were barcoded and run on a single flow cell. Within the entire dataset the difference between the largest and smallest sample read count is about 5 fold (and size factors ranging from 0.34 - 3.2). Within each group the number of reads is similar (for the most part) but differences exist between groups that we intend to compare. I'll use our first comparison as an example: For Group1 the samples have ~ 4 million reads per sample, where Group 2 has >7 million reads per sample. The total number of genes detected between the two groups is also different.

    To assess expression changes I used DESeq2, but am wondering whether normalizing with size factor accounts is enough to account for this? Suppose GeneA was not detected in Group1 as a consequence of the small number of reads, but is lowly expressed in Group2. This gene would be identified as DE although we don't know if that is necessarily the case.

  • #2
    In a case such as that, the first thing I would do is only even include the subset of genes that you did actually detect in all samples in the comparison. As you indicated, you cannot argue that a failure to detect equates to the absence of expression, so you really should not even be considering such genes in your comparison.

    For my own analyses, the first thing I do after mapping reads is derive the subset of features that actually have a raw count greater than zero in all my samples. I only analyze that feature set for differential expression.
    Michael Black, Ph.D.
    ScitoVation LLC. RTP, N.C.

    Comment


    • #3
      Removing features which have zero count in more than one sample, will leave me with a very small subset. DESeq2 applies filters where genes with all zero counts are removed AND rows that have extreme count outlier samples - which already reduces the feature set to less than half. Although I suppose that is one way to ensure that lack of reads in not the reason for the observed change

      Comment


      • #4
        Have a look at a PCA plot and/or hierarchical clustering plot and see if the difference in library size is causing one or more samples to be obvious outliers. I've not seen that happen for ~5x size differences, but certainly for >=10x and wouldn't rule it out in any case.

        Comment


        • #5
          Originally posted by mistrm View Post
          Removing features which have zero count in more than one sample, will leave me with a very small subset. DESeq2 applies filters where genes with all zero counts are removed AND rows that have extreme count outlier samples - which already reduces the feature set to less than half. Although I suppose that is one way to ensure that lack of reads in not the reason for the observed change
          Not to sound harsh, but to my mind, it is immaterial how much it reduces your feature set. The reality is that including DE calls for genes where one of the references is to a sample for which you actually have no data (failure to detect) is simply not valid. If you ran two qPCR reactions, and one worked giving valid data and the other did not and thus gave no data, would you include that gene in your results? Any genes you want to talk about as differentially expressed, you need to have an actual measure of expression for each sample in the comparison. There is a certain stochasticity in detection of low expressors, as those are inherently the rarer transcripts in your sample, so not having even detected anything in one sample makes any statement about differential expression relative to another highly suspect.

          If dpryan's suggestion doesn't yield any obvious abberant samples, and you need a larger feature set, then you should either add more replicates or more reads per sample. Do you still have any material left you could sequence further to increase read depth?
          Last edited by mbblack; 08-20-2014, 04:35 AM.
          Michael Black, Ph.D.
          ScitoVation LLC. RTP, N.C.

          Comment


          • #6
            Originally posted by mbblack View Post
            Not to sound harsh, but to my mind, it is immaterial how much it reduces your feature set.
            I couldn't agree more. The name of the game is not creating undue extra work and headaches for yourself.

            Comment


            • #7
              Agree with you both. Though (just for discussion purpose), instead of removing features that have a zero count in any sample across both groups wouldn't it make sense to remove only features that have zero count in Group1 (the group with lower depth samples). For Group2 if there is zero count, there are enough reads to more reliably conclude features as low expressors as opposed to failure to detect. Particularly, if there is increased expression of these low expressors in Group1, we would want to capture those changes.

              There is still material left and will likely to sequence further as it seems the best solution. Thanks for all the help!

              Comment


              • #8
                Not to my mind. You cannot say anything about differential expression based on the absence of data, regardless of what you see in the other sample. Nor can you, to my mind, say that an absence of data, at any read depth, is equal to an absence of expression. There is simply far too much variability in low expressor detection to say that, regardless of read depth. Again, an absence of count data cannot be taken as an absence of a transcript nor absence of expression of that transcript.

                Typically as you increase read depth, you see an ever increasing accumulation of counts for transcripts already detected. Your probability of detecting very low expressors does not change all that much at all, and there will always be a low but persistent probability of detection of novel transcripts relative to higher count features at even read depths of hundreds of millions of reads per sample.

                You say "if there is increased expression of these low expressors in Group1" but how can you say anything about relative expression (increased or decreased) if you do not have any actual data for that transcript in Group 2? All you know is you saw it in Group 1 and did not see it in Group 2, but you have no conclusive information about just why you did not see it in Group 2 (was it truly not expressed, or was it expressed and just missed due to the inherent vagaries of detection in every RNA seq experiment?).

                The only valid contrasts you can make are between samples/groups for which you actually have data in both. For those where you have no data in one group, all you can say is you detected gene "x" in one, and did not detect it in the other - that's it. To infer anything else about the relative relationship of the two groups is pure speculation, and one for which you do not have supportive data since you have no data at all for one group.

                If your goal is to truly demonstrate the absence of expression in one group, then RNA-seq was never the appropriate experiment to use in the first place.
                Last edited by mbblack; 08-20-2014, 08:07 AM.
                Michael Black, Ph.D.
                ScitoVation LLC. RTP, N.C.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 07:03 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-10-2024, 06:35 AM
                0 responses
                37 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                45 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                39 views
                0 likes
                Last Post seqadmin  
                Working...
                X