Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 DataSetFromMatrix question

    Hi,

    I've got a pretty simple RNAseq experiment (2 conditions, 3 biological replicates) but I'm having trouble getting the count data into DESeq2.

    What I've done:

    > bckCountTable <- read.table("bck_counts.txt", header=TRUE, row.names=1)

    > head(bckCountTable)
    GeneID ctl1 ctl2 ctl3 del1 del2 del3
    1 CNAG_00001 0 0 0 0 0 0
    2 CNAG_00002 29 34 27 26 13 21
    3 CNAG_00003 38 26 38 41 38 27
    4 CNAG_00004 63 42 58 58 62 55

    > ExpDesign <- data.frame(row.names=colnames(bckCountTable), condition = c("ctl","ctl","ctl","del","del","del"))

    > bckCDS <- DESeqDataSetFromMatrix(countData = bckCountTable, colData=ExpDesign$condition, design=~ExpDesign$condition)
    Error in validObject(.Object) :
    invalid class “SummarizedExperiment” object: invalid object for slot "colData" in class "SummarizedExperiment": got class "factor", should be or extend class "DataFrame"

    I guess I don't understand how to define colData.

    Any advice would be appreciated.

    Regards,
    Maureen


    > sessionInfo()
    R version 3.0.2 (2013-09-25)
    Platform: x86_64-apple-darwin10.8.0 (64-bit)

    locale:
    [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

    attached base packages:
    [1] parallel stats graphics grDevices utils datasets methods base

    other attached packages:
    [1] DESeq2_1.2.2 RcppArmadillo_0.3.920.1 Rcpp_0.10.6 GenomicRanges_1.14.3
    [5] XVector_0.2.0 IRanges_1.20.4 BiocGenerics_0.8.0 BiocInstaller_1.12.0

    loaded via a namespace (and not attached):
    [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 DBI_0.2-7 genefilter_1.44.0
    [6] grid_3.0.2 lattice_0.20-24 locfit_1.5-9.1 RColorBrewer_1.0-5 RSQLite_0.11.4
    [11] splines_3.0.2 stats4_3.0.2 survival_2.37-4 tools_3.0.2 XML_3.95-0.2
    [16] xtable_1.7-1

  • #2
    In trying to understand my issue with the use of the DESeqDataSetfromMatrix command in DESeq2 I tried to compare my data to the Pasilla data used in the vignette:

    > data("pasillaGenes")
    > countData <- counts(pasillaGenes)

    > colData <- pData(pasillaGenes)[,c("condition","type")]
    > colData
    condition type
    treated1fb treated single-read
    treated2fb treated paired-end
    treated3fb treated paired-end
    untreated1fb untreated single-read
    untreated2fb untreated single-read
    untreated3fb untreated paired-end
    untreated4fb untreated paired-end

    Which imports nicely into a DEseq Data Set:
    > pasilla_dds <- DESeqDataSetFromMatrix(countData, colData, formula(~condition))

    However, I don't understand where does the type information come from. It doesn't seem to be in the countData matrix because when I look at countData, all I see are 7 columns of data, (GeneIDs plus the count columns).

    > head(countData)
    treated1fb treated2fb treated3fb untreated1fb untreated2fb untreated3fb untreated4fb
    FBgn0000003 0 0 1 0 0 0 0
    FBgn0000008 78 46 43 47 89 53 27
    FBgn0000014 2 0 0 0 0 1 0
    FBgn0000015 1 0 1 0 1 1 2
    FBgn0000017 3187 1672 1859 2445 4615 2063 1711
    FBgn0000018 369 150 176 288 383 135 174

    > dim(countData)
    [1] 14470 7

    My data seems to be quite similar, but when I try using pData to set the colData variable, I get an error.

    My data:
    > head(bckCountTable)
    ctl1 ctl2 ctl3 del1 del2 del3
    CNAG_00001 0 0 0 0 0 0
    CNAG_00002 29 34 27 26 13 21
    CNAG_00003 38 26 38 41 38 27
    CNAG_00004 63 42 58 58 62 55
    CNAG_00005 57 49 49 30 39 40
    CNAG_00006 433 398 571 422 353 435

    > dim(bckCountTable)
    [1] 6967 6

    When I try to set colData using the same commands:
    > colData <- pData(bckCountTable)[,c("condition","type")]
    Error in (function (classes, fdef, mtable) :
    unable to find an inherited method for function ‘pData’ for signature ‘"data.frame"’

    I'm still not clear how my matrix of counts differs from that generated for the Pasilla genes.

    I also tried following the example in the reference guide:
    > colData <- data.frame(condition=factor(c("ctl","del")))

    > colData
    condition
    1 ctl
    2 del

    But when I try to set up the DEseqDataSet, I get an error
    > bck_dds <- DESeqDataSetFromMatrix(bckCountTable, colData, formula(~condition))
    Error in validObject(.Object) :
    invalid class “SummarizedExperiment” object: 'colData' nrow differs from 'assays' ncol

    Again, any advice or ideas would be most appreciated.

    Regards,
    Maureen

    Comment


    • #3
      The original problem is partly that "bckCountTable" contains a column with gene names. Just remove that. Also, ExpDesign only has one column, so you can just:
      Code:
      bckCDS <- DESeqDataSetFromMatrix(countData = bckCountTable, colData=ExpDesign, design=~condition)
      or something along those lines. The number of rows in colData need to match the number of samples that you're analysing.

      Regarding "type", you won't find it in the count matrix because it just describes whether things were run as single or paired-end. It's only used latter in the vignette where they discuss multifactor designs. For simple experiments, ignore it. You can't use the design from the pasilla experiment because it had 7 samples and you have 6.

      Comment


      • #4
        I appreciate your help, but unless I've missed something, if I remove the gene name IDs from the count data, then I won't be able to figure out what genes are differentially expressed.

        The issue really seems to be with how the colData is defined.

        I tried defining a samples data frame:

        > samples
        samples condition
        1 ctl1 ctl
        2 ctl2 ctl
        3 ctl3 ctl
        4 del1 del
        5 del2 del
        6 del3 del

        And then using:
        > bckCDS <- DESeqDataSetFromMatrix(countData = bckCountTable, colData=samples$samples, design=samples$condition)

        But I still get the same error:
        Error in validObject(.Object) :
        invalid class “SummarizedExperiment” object: invalid object for slot "colData" in class "SummarizedExperiment": got class "factor", should be or extend class "DataFrame"

        Comment


        • #5
          The gene names need to be the row.names, not a column of their own as they were in your example. In what you just wrote, try instead:
          Code:
          samples <- data.frame(row.names=c("ctl1","ctl2","ctl3","del1","del2","del3"),
              condition=as.factor(c(rep("ctl",3), rep("del",3))))
          bckCDS <- DESeqDataSetFromMatrix(countData = bckCountTable, colData=samples, design=~condition)

          Comment


          • #6
            Thank you dpryan! That worked quite well.

            To summarize what worked:

            > bckCountTable <- read.table("bck_counts.txt", header=TRUE, row.names=1)
            > head(bckCountTable)
            ctl1 ctl2 ctl3 del1 del2 del3
            CNAG_00001 0 0 0 0 0 0
            CNAG_00002 29 34 27 26 13 21
            CNAG_00003 38 26 38 41 38 27
            CNAG_00004 63 42 58 58 62 55
            CNAG_00005 57 49 49 30 39 40
            CNAG_00006 433 398 571 422 353 435


            > samples <- data.frame(row.names=c("ctl1","ctl2","ctl3","del1","del2","del3"), condition=as.factor(c(rep("ctl",3),rep("del",3))))
            > samples
            condition
            ctl1 ctl
            ctl2 ctl
            ctl3 ctl
            del1 del
            del2 del
            del3 del

            > bckCDS <- DESeqDataSetFromMatrix(countData = bckCountTable, colData=samples, design=~condition)
            > bckCDS_1 <- DESeq(bckCDS)
            estimating size factors
            estimating dispersions
            gene-wise dispersion estimates
            mean-dispersion relationship
            final dispersion estimates
            fitting model and testing

            > bck_res <- results(bckCDS_1)

            > head(bck_res)
            DataFrame with 6 rows and 6 columns
            baseMean log2FoldChange lfcSE stat pvalue padj
            <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
            CNAG_00001 0.00000 NA NA NA NA NA
            CNAG_00002 25.38193 -0.50881187 0.2159023 -2.3566760 0.018439326 NA
            CNAG_00003 34.46418 -0.08588360 0.2010327 -0.4272120 0.669224911 0.84870693
            CNAG_00004 56.07399 -0.05626052 0.1771525 -0.3175824 0.750801714 0.89623448
            CNAG_00005 44.69432 -0.52210975 0.1925740 -2.7112160 0.006703695 0.05953184
            CNAG_00006 434.90795 -0.35984037 0.1175182 -3.0619982 0.002198648 0.02705483

            > write.csv(bck_res,file="bck_results.csv")

            I should have filtered out those genes with few or no counts, but I can do that after importing the data into Excel.

            Thanks again.

            Maureen

            Comment


            • #7
              I just want to say thank you Mdonlin. I am new to data analysis and your final summary helped me a lot.

              Comment


              • #8
                Gong,

                Glad the summary helped.

                Maureen

                Comment


                • #9
                  Maureen,

                  Many thanks for your detailed summary. This was perfect to get me up and running.

                  Cheers,
                  Mark

                  Comment


                  • #10
                    Hi,
                    I'm still not able to enter the data. Here's what it shows..

                    RivsRU <-read.table("RIvsRUdata", header=TRUE, row.names = 1)
                    > head(RivsRU)
                    FEATURE_ID RI2 RI3 RI1 RU1 RU2 RU3
                    1 AAAAAAAAAAAAAAAAA 2 3 2 8 1 2
                    2 AAAAAAAAAAAAAAAAAA 2 2 1 10 1 5
                    3 AAAAAAAAAAAAAAAAAAA 0 1 0 3 2 1
                    4 AAAAAAAAAAAAAAAAAAAA 1 1 0 2 0 0
                    5 AAAAAAAAAAAAAAAAAAAAA 0 0 1 2 1 2
                    6 AAAAAAAAAAAAAAAAAAAAAACAAAAA 1 0 0 0 0 0
                    >samples <- data.frame(row.names=c("RI1","RI2","RI3","RU1","RU2","RU3"), condition=as.factor(c(rep("RI",3),rep("RU",3))))

                    > RIvsRU <- DESeqDataSetFromMatrix(countData = RivsRU, colData = samples, design=~conditions)

                    Error in validObject(.Object) :
                    invalid class “SummarizedExperiment0” object: 'assays' ncol differs from 'colData' nrow
                    In addition: Warning message:
                    In sort(rownames(colData)) == sort(colnames(countData)) :
                    longer object length is not a multiple of shorter object length

                    Could someone please tell me what am I doing wrong?
                    Thanks
                    Last edited by Sow; 02-16-2016, 05:54 PM.

                    Comment


                    • #11
                      I see that I still have the gene names as a separate column - my bad.
                      What would be the code to change row names to gene names?
                      Sorry if this is a simple thing in R, I'm still very new to R!

                      Comment


                      • #12
                        Hi Sow,

                        just use:
                        Code:
                        x=RivsRU[,-1] #removing the first column
                        rownames(x)=RivsRU[,1] # adding the first column as rownames
                        Afterwards you'll just use countData=x.

                        P.S.: be careful using similar names for different variables: RivsRU and RIvsRU will be hard for others to understand your code.

                        Cheers,
                        Michael
                        Last edited by Michael.Ante; 02-17-2016, 12:46 AM. Reason: Typo

                        Comment


                        • #13
                          Thanks Michele,

                          I will keep your suggestion in mind as well.

                          Sow

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          27 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          30 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          26 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X