Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • clustering RNA-Seq data

    1) Are "standard" clustering algorithms appropriate for RNA-Seq data due to its discrete nature?
    2) What sort of normalization (if any) should be done to RNA-Seq data before clustering?
    3) Does the lower variance of long genes need to be accounted for using the output of clustering of RNA-Seq data? i.e. when doing gene set enrichment analysis of the genes in a cluster?

    Thanks,
    Julie

  • #2
    Dear Julie,

    have a look at this thread: http://seqanswers.com/forums/showthread.php?t=19007 .

    best wishes
    Wolfgang
    Wolfgang Huber
    EMBL

    Comment


    • #3
      Thanks. After looking at that thread I would assume standard clustering algorithms typically used for microarray data are suitable for RNA-Seq data.

      I'm still unclear of what sort of normalization (if any) is necessary prior to clustering. Particularly, I'm concerned about length normalization:
      I know that RNA-Seq data has the bias that longer genes tend to be more often called
      differentially expressed due to an increase in statistical power. The issue
      here is that longer genes --> more reads --> lower variance --> higher power to
      detect differences? I am wondering if this difference in variance levels between
      long and short genes would have an effect on the results of clustering?

      Comment


      • #4
        Dear Julie,

        are you sure you have read the thread Wolfgang recommended to the end. The whole point is that you cannot use standard approaches that assume homoskedasticity to RNA-Seq data, which is heteroskedastic, and the point of the variance stabilizing transformation offered by our DESeq package is to rectify this. This transformation implicitly takes care of the length issue, too. See also this post on this topic:

        Comment


        • #5
          I find 'length' is a red herring. The real issue is power (of a test) or precision (of an estimate), and that depends on the number of counts. The number of counts vary by roughly 6 orders of magnitude (10^6) between genes, whereas their length varies much less, which already tells you that there are also other, and more important factors at play.
          Wolfgang Huber
          EMBL

          Comment


          • #6
            power depends on counts

            i recently found a near perfect logarithmic correlation between sum of counts over samples and power to detect significantly differentially expressed (DE) genes. this was seen more pronounced for SAMseq, which uses permutation for null-hypotheses generation and testing, compared to other methods using parametric approaches.
            i think, this behavior is due to the digital nature of the data.

            see figure 1.
            on the x-axis the sum of counts over 24 samples (design: matched pairs of 12 vs 12) is shown on a logarithmic scale (10 means 0-10 counts; 20 means 11-20 counts). number of genes in each bin is given above the x-axis.

            on the y-axis the percentage of DE genes (FDR 5%) in each sum count bin is shown.

            in my opinion, this shows clearly a severe problem in interpreting DE gene lists from RNAseq.

            1) it is more likely, that higher expressed (and less important: longer genes) get called DE.
            2) sequencing to very, very high depths will finally call all genes as DE (which is presumably true, but un-interpretable)

            Interestingly, fold changes of DE genes decrease not dramatically with increasing sum counts, and are therefore not really useful for further discrimination.

            i think, this fact should be taken into account during pathway analysis (at least for enrichment and SPIA analyses, not for GS(E)A). lower expressed DE genes (i.e. genes with less counts) should be weighted higher compared to higher expressed genes (which get called easier).

            is there a program for pathway analysis, which takes this into account?
            Attached Files

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            58 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            46 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X