Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNAseq datas analysis strategy and clustering

    Hi everyone,

    I'm a fresh french PhD student in biology who has to struggle since a few weeks with RNAseq datas with a little (or no) background in statistics. I've read a lot but as there is no biostatisticians in my lab, i'm coming here for a advices.
    The aim of my project is to make sub groups in a patients cohort and to establish a molecular signature with a minimal list of differentially expressed genes. For that, i got RNAseq results from 40 patients and 2 control samples with no replicates.
    So after a few research i started to use DESeq (with the reads, don't panic), and my idea is:
    First, making a clustering with a big list of genes (around 7700 after filtering genes with low coverage), and then after having my sub group I'll consider one group as biological replicates and compare each of the group with the rest of the patients, and filter the differentially expressed genes, add them in one list which will allow me to discriminate every group.
    So i followed the DESeq workflow like this:

    allreads2 <- read.table("allreads2.txt", header=TRUE, row.names=1)
    conds <- factor(c("p10019", "p10139", "p10381", "p10398", "p10414", "p10487",
    "p10878", "p10975", "p10991", "p11065", "p11176", "p11229", "p11478",
    "p11913", "p13465", "p13690", "p14000", "p14246", "p14435", "p14462",
    "p14470", "p14696", "p14696", "p14896", "p14903", "p15159", "p15867",
    "p15949", "p16070", "p17343", "p17495", "p17838", "p1918", "p459",
    "p4659", "p5844", "p5929", "p9506", "p9545", "p9882", "control1", "control2" )) (I have considered each patient as one conditions)
    cds <- newCountDataSet( allreads2, conds )
    cds <- estimateSizeFactors( cds )
    normcds <- counts( cds, normalized=TRUE )
    cds <- estimateDispersions( cds, method="blind", sharingMode="fit-only" )
    vsd <- varianceStabilizingTransformation( cds )

    And then i have made the heatmap of the count table:
    library("RColorBrewer")
    library("gtools")
    select = order(rowMeans(counts(cds)), decreasing=TRUE)[1:7716]
    colors <- colorRampPalette(c("yellow","red"))(100)
    heatmap(exprs(vsd)[select,], col = colors, trace="none", margin=c(10,10))

    And the heatmap sample to sample distance
    dists <- dist( t( exprs(vsd) ) )
    mat = as.matrix( dists )
    heatmap(mat, trace="none", col = rev(colors), margin=c(13, 13))

    Which gave me different results in clustering.

    So here are my questions :
    1. First of all, is my strategy to get this molecular signature and my R commands right? Otherwise, what should i change?
    2. And then which heatmap seems the more appropriate to do my sub groups according to my experiment and the purpose and why (that's actually what's really blocking me at this point, because everybody in my lab has a different point of view on this, and cannot explain)?
    That would be really really great if someone can help a student who dont want to end like most of biologists, bad in statistics ^^.

    Thanks a lot.

  • #2
    You have quite an interesting project! It's pretty common that everyone you ask will have a different opinion on what the best way to approach this is, typically asking 10 people will get you 13 different answers (none of which are necessarily wrong).

    Personally, I would prefer a PCA plot over the heatmap (see the "plotPCA" function), with which I think would prove a bit simpler to eye-ball clustered samples. You might also consider just using k-means clustering (or similar) to get an idea how many groups you can optimally partition your samples into, but that'll take longer.

    Comment


    • #3
      Thanks for your answer dpryan. You highlight the problem, as i get many answers with no really logical, or statistical explanations, i'm a bit confused for choosing the "best" clustering method which really fits in my case. I won't be too harsh of course, i'm the same in this field. Actually, I've tried the PCA plot but it did work well because n is too large with this warning message: In brewer.pal(nlevels(fac), "Paired").
      But is the strategy and the R commands appropriate?

      Comment


      • #4
        I have no idea what software you have access to, but if you have access to SAS you can use it to perform K-means clustering for all possible values of K for your dataset (2 to 40 in your case). Then use SAS's Cubic Clustering Criterion (their CCC statistic) to determine the optimal value of K for your dataset (i.e. the K with the largest CCC value). That then will define your groups.

        I don't know if CCC is implemented in R as it is a SAS creation.
        Michael Black, Ph.D.
        ScitoVation LLC. RTP, N.C.

        Comment


        • #5
          Etian - one thing my lab has found useful and informative is to produce a cluster plot with R's 'hclust' function. We've used this to cluster samples and the results have been not only logical but experimentally verified.

          After your line where you make the variable 'vsd'...

          vsd <- varianceStabilizingTransformation( cds )

          do this...

          Code:
          di <- dist(t(vsd), method="euclidean")
          hc <- hclust(di, method="ward")
          plot(hc)
          When you plot you'll get what's called a dendrogam (you can find information about them online). With a dendrogram you can observe the grouping of samples based on how close they are within the tree relative to one another. Literally, the further you'd have to trace a path from one sample to another translates to how dissimilar they are. Also with a dendrogram you can draw a horizontal line "cutting" the tree to call clusters.

          this link has a couple example images that show what I'm talking about:
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #6
            Thanks for your advice mbblack but i don't have access to SAS (I believe it's only big lab that can afford something like that) but i'll check for the CCC via R.
            Thanks a lot sdriscoll, I think i' m going to use that method hclust, the info i've found for the moment seems to legitimate that approach. I just have to use
            di <- dist(t(exprs(vsd)), method="euclidean") instead of
            di <- dist(t(vsd), method="euclidean")

            Comment


            • #7
              Originally posted by Etian View Post
              Thanks for your advice mbblack but i don't have access to SAS (I believe it's only big lab that can afford something like that) but i'll check for the CCC via R.
              Actually, at least here in the USA, most major universities will have floating site licenses for several stats packages like SAS, Matlab, SPSS and so forth.
              Michael Black, Ph.D.
              ScitoVation LLC. RTP, N.C.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 08:47 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              59 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X