Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weird DE-results

    I'm doing some differential expression analysis with DESeq2, limma(voom) and edgeR, and I get some weird results. There is a gene that I know should be highly expressed in one of my samples, but my results tell me it's the other way around. Looking at both the counts and the FPKM values (from Cufflinks) for said gene, they both agree that I should get a log2 fold change around 0.8, but I get around -0.8. Setting aside the fact that eyeballing a fold change from counts or FPKM isn't the right way to do things, it should at the very least be of the correct sign, right?

    I have not noticed this until now, when I happened to check this particular gene, and now I seem to see this in other genes as well. I can do some other eyeballing for random genes where I see a difference in counts/FPKM, and while it's not always so pronounced, I do seem to get some weird values. So, I started looking in my code for possible reasons for this. I checked my dds structure in DESeq2, and this is what it looks like:

    Code:
    dds
    class: DESeqDataSet 
    dim: 20477 7 
    exptData(0):
    assays(1): counts
    rownames(20477): ENSG00000000003 ENSG00000000005 ... ENSG00000273431
      ENSG00000273457
    rowData metadata column names(0):
    colnames(7): 1 2 ... 6 7
    colData names(2): names.data. condition
    Now, comparing this to the DESeq2 vignette, everything seems fine, except the "colnames(7): 1 2 ... 6 7" part - instead of numbers in my data, the vignette has "treated1, treated2", etc instead. I figured it must then be due to some fault in me creating the dds construct, but I can't figure out what. This is (the relevant parts of) my code:

    Code:
    # Load data
    cat('reading data\n')
    data = read.table("counts/collected_counts.txt", header=TRUE, row.names=1)
    
    # Filter for desired samples
    samples = strsplit(args$input_samples, ",")[[1]]
    data = data[ , c(grep(samples[1], names(data), value=TRUE), grep(samples[2], names(data), value=TRUE))]
    
    # Load data in DESeq2
    number_samples_1 = length(grep(samples[1], names(data), value=TRUE))
    number_samples_2 = length(grep(samples[2], names(data), value=TRUE))
    condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2)))
    organization = data.frame(names(data), condition=condition)
    
    #DESeq2 calculations
    dds = DESeqDataSetFromMatrix(countData=data, colData=organization, design=~condition)
    dds
    dds = DESeq(dds, parallel=TRUE)
    res_DESeq2 = results(dds, parallel=TRUE)
    I call my script from the terminal as an Rscript like so:

    Code:
    de_analysis.R cell_line_1,cell_line_2
    ... mainly so that I can do different analyses on different combinations of cell lines without having to do different sets of code. I am guessing that there's something wrong with part of this that is making the fold change go bonkers. It is my intent that the fold change would here be cell_line_1 / cell_line_2. Given an experiment with 3 and 4 replicates per cell line (as above) the "condition" variable looks like this:

    Code:
    [1] hct hct hct rko rko rko rko
    Levels: hct rko
    And the "organization" variable like this:

    Code:
      names.data. condition
    1       hct_a       hct
    2       hct_b       hct
    3       hct_c       hct
    4       rko_a       rko
    5       rko_b       rko
    6       rko_c       rko
    7       rko_d       rko
    ... which looks right to me. I am stumped, and I would VERY much appreciate any help with this!
    Last edited by ErikFas; 02-17-2015, 06:16 AM.

  • #2
    The lack of the colnames slot having the actual sample names isn't an issue. If you set the row.names() of "organization" to the sample names then that slot will get filled in.

    It's likely that all of the fold-changes just have the opposite sign to what you expect because the base level is different from what you expect. Unless you explicitly specify a factor level order, R will set the lexicographic first level as a factor as the base level. You might consider using the contrast= argument to results() just so that you can specify the base level in the fold-change more conveniently (that, or just have that set further up when you make the "organization" data.frame).

    Comment


    • #3
      Like dpryan said, it might just be the case of having the two conditions switched around that leads to the inverse fold change. You could check this using something like head(res_DESeq2) for the DESeq2 results which would tell you the order in which it is comparing your conditions. I hope this helps.

      Comment


      • #4
        @aggp11
        Okay, i tried head(res_DESeq2) and this is what I get:

        Code:
        log2 fold change (MAP): condition rko vs hct 
        Wald test p-value: condition rko vs hct
        ... and I'm calling the script as hct,rko. Does that mean that it is doing fold change = hct / rko (like I want) or the other way around?

        @dpryan
        I haven't really used contrasts much, as I read in the vignette that it's mainly used for cases where you have more than 2 comparisons (i.e. A vs B vs C and the combinations thereof), or am I misreading that? Or do you mean some sort of "hct vs rko vs base level"? (What is base level here, anyway?)

        Comment


        • #5
          The other way around. rko vs hct means log2(rko/hct). You can specify the groups in any order and this will still be how the fold-change is computed due to how factors are constructed in R.

          Regarding contrasts, yes, those are mostly used with more groups, but they can also allow you to arbitrarily set which comparison is used for the fold changes. For a baselevel, R will always use the lexicographicly first factor level. Since "hct" would come before "rct" in a dictionary, it's the base level used for comparisons. Similarly, if your groups were "control" and "cancer", then the fold-change would be control/cancer, even though that's the opposite of what you want. So either set the base level manually:
          Code:
          groups <- factor(groups, levels=c("rko", "hct"))
          or use a contrast.

          Comment


          • #6
            Okay, so setting groups as you said (levels=c("rko","hct")) would make the fold change be hct / rko? I don't have any groups-parameter in any of my function calls that I know of; where is it supposed to go, and where does the already existing groups that you use come from?

            Comment


            • #7
              Originally posted by ErikFas View Post
              Okay, so setting groups as you said (levels=c("rko","hct")) would make the fold change be hct / rko?
              Exactly. the levels= part is a convenient way to reset how R would normally handle things.

              I don't have any groups-parameter in any of my function calls that I know of; where is it supposed to go, and where does the already existing groups that you use come from?
              "groups" was just an example name. I guess it's called "organization" in your script.

              Comment


              • #8
                That did the trick! Although the thing I needed to add was condition, like this:

                Code:
                condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2)))  # as previously
                condition = factor(condition, levels=c(samples[2], samples[1]))  # new line
                ... rather than organization, which I started with. I then checked the DESeq2 vignette, and they did:

                Code:
                dds$condition = factor(dds$condition, levels=c(samples[2], samples[1]))
                ... which also works just fine, except it doesn't do anything for my downstream analyses of limma(voom) and edgeR - changing the condition parameter does. So, thanks again for all your help!

                Comment


                • #9
                  hi Erik,

                  Do we actually have the line of code with "levels=c(samples[2], samples[1])" somewhere? I can't find it. I try to encourage explicitly writing out the level names as character, because sample order can change.

                  Comment


                  • #10
                    Hey, Michael! Sorry, I wasn't being clear. You have the line written explicitly as "levels=c("untreated","treated")", just like you say you do - I was just writing the equivelant for my code for clarity of the discussion. Sorry for the confusion!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    47 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X