SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq results: an weird gene Jerry_Zhao RNA Sequencing 0 07-30-2014 06:32 AM
Weird bioanalyzer results exo Sample Prep / Library Generation 3 03-27-2014 02:15 PM
Need help interpreting these weird Bioanalyzer results Sciurus Sample Prep / Library Generation 5 01-29-2014 08:04 AM
weird bioanalyzer results newendophytologist 454 Pyrosequencing 10 10-02-2012 11:42 AM
very weird samtools pileup results -- help! csoong Bioinformatics 1 12-25-2010 03:20 AM

Reply
 
Thread Tools
Old 02-17-2015, 04:36 AM   #1
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default Weird DE-results

I'm doing some differential expression analysis with DESeq2, limma(voom) and edgeR, and I get some weird results. There is a gene that I know should be highly expressed in one of my samples, but my results tell me it's the other way around. Looking at both the counts and the FPKM values (from Cufflinks) for said gene, they both agree that I should get a log2 fold change around 0.8, but I get around -0.8. Setting aside the fact that eyeballing a fold change from counts or FPKM isn't the right way to do things, it should at the very least be of the correct sign, right?

I have not noticed this until now, when I happened to check this particular gene, and now I seem to see this in other genes as well. I can do some other eyeballing for random genes where I see a difference in counts/FPKM, and while it's not always so pronounced, I do seem to get some weird values. So, I started looking in my code for possible reasons for this. I checked my dds structure in DESeq2, and this is what it looks like:

Code:
dds
class: DESeqDataSet 
dim: 20477 7 
exptData(0):
assays(1): counts
rownames(20477): ENSG00000000003 ENSG00000000005 ... ENSG00000273431
  ENSG00000273457
rowData metadata column names(0):
colnames(7): 1 2 ... 6 7
colData names(2): names.data. condition
Now, comparing this to the DESeq2 vignette, everything seems fine, except the "colnames(7): 1 2 ... 6 7" part - instead of numbers in my data, the vignette has "treated1, treated2", etc instead. I figured it must then be due to some fault in me creating the dds construct, but I can't figure out what. This is (the relevant parts of) my code:

Code:
# Load data
cat('reading data\n')
data = read.table("counts/collected_counts.txt", header=TRUE, row.names=1)

# Filter for desired samples
samples = strsplit(args$input_samples, ",")[[1]]
data = data[ , c(grep(samples[1], names(data), value=TRUE), grep(samples[2], names(data), value=TRUE))]

# Load data in DESeq2
number_samples_1 = length(grep(samples[1], names(data), value=TRUE))
number_samples_2 = length(grep(samples[2], names(data), value=TRUE))
condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2)))
organization = data.frame(names(data), condition=condition)

#DESeq2 calculations
dds = DESeqDataSetFromMatrix(countData=data, colData=organization, design=~condition)
dds
dds = DESeq(dds, parallel=TRUE)
res_DESeq2 = results(dds, parallel=TRUE)
I call my script from the terminal as an Rscript like so:

Code:
de_analysis.R cell_line_1,cell_line_2
... mainly so that I can do different analyses on different combinations of cell lines without having to do different sets of code. I am guessing that there's something wrong with part of this that is making the fold change go bonkers. It is my intent that the fold change would here be cell_line_1 / cell_line_2. Given an experiment with 3 and 4 replicates per cell line (as above) the "condition" variable looks like this:

Code:
[1] hct hct hct rko rko rko rko
Levels: hct rko
And the "organization" variable like this:

Code:
  names.data. condition
1       hct_a       hct
2       hct_b       hct
3       hct_c       hct
4       rko_a       rko
5       rko_b       rko
6       rko_c       rko
7       rko_d       rko
... which looks right to me. I am stumped, and I would VERY much appreciate any help with this!

Last edited by ErikFas; 02-17-2015 at 05:16 AM.
ErikFas is offline   Reply With Quote
Old 02-17-2015, 05:34 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The lack of the colnames slot having the actual sample names isn't an issue. If you set the row.names() of "organization" to the sample names then that slot will get filled in.

It's likely that all of the fold-changes just have the opposite sign to what you expect because the base level is different from what you expect. Unless you explicitly specify a factor level order, R will set the lexicographic first level as a factor as the base level. You might consider using the contrast= argument to results() just so that you can specify the base level in the fold-change more conveniently (that, or just have that set further up when you make the "organization" data.frame).
dpryan is offline   Reply With Quote
Old 02-17-2015, 01:39 PM   #3
aggp11
Member
 
Location: Wisconsin

Join Date: Jun 2011
Posts: 87
Default

Like dpryan said, it might just be the case of having the two conditions switched around that leads to the inverse fold change. You could check this using something like head(res_DESeq2) for the DESeq2 results which would tell you the order in which it is comparing your conditions. I hope this helps.
aggp11 is offline   Reply With Quote
Old 02-17-2015, 11:07 PM   #4
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

@aggp11
Okay, i tried head(res_DESeq2) and this is what I get:

Code:
log2 fold change (MAP): condition rko vs hct 
Wald test p-value: condition rko vs hct
... and I'm calling the script as hct,rko. Does that mean that it is doing fold change = hct / rko (like I want) or the other way around?

@dpryan
I haven't really used contrasts much, as I read in the vignette that it's mainly used for cases where you have more than 2 comparisons (i.e. A vs B vs C and the combinations thereof), or am I misreading that? Or do you mean some sort of "hct vs rko vs base level"? (What is base level here, anyway?)
ErikFas is offline   Reply With Quote
Old 02-17-2015, 11:27 PM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The other way around. rko vs hct means log2(rko/hct). You can specify the groups in any order and this will still be how the fold-change is computed due to how factors are constructed in R.

Regarding contrasts, yes, those are mostly used with more groups, but they can also allow you to arbitrarily set which comparison is used for the fold changes. For a baselevel, R will always use the lexicographicly first factor level. Since "hct" would come before "rct" in a dictionary, it's the base level used for comparisons. Similarly, if your groups were "control" and "cancer", then the fold-change would be control/cancer, even though that's the opposite of what you want. So either set the base level manually:
Code:
groups <- factor(groups, levels=c("rko", "hct"))
or use a contrast.
dpryan is offline   Reply With Quote
Old 02-17-2015, 11:37 PM   #6
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Okay, so setting groups as you said (levels=c("rko","hct")) would make the fold change be hct / rko? I don't have any groups-parameter in any of my function calls that I know of; where is it supposed to go, and where does the already existing groups that you use come from?
ErikFas is offline   Reply With Quote
Old 02-17-2015, 11:39 PM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by ErikFas View Post
Okay, so setting groups as you said (levels=c("rko","hct")) would make the fold change be hct / rko?
Exactly. the levels= part is a convenient way to reset how R would normally handle things.

Quote:
I don't have any groups-parameter in any of my function calls that I know of; where is it supposed to go, and where does the already existing groups that you use come from?
"groups" was just an example name. I guess it's called "organization" in your script.
dpryan is offline   Reply With Quote
Old 02-18-2015, 12:21 AM   #8
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

That did the trick! Although the thing I needed to add was condition, like this:

Code:
condition = as.factor(c(rep(samples[1], number_samples_1), rep(samples[2], number_samples_2)))  # as previously
condition = factor(condition, levels=c(samples[2], samples[1]))  # new line
... rather than organization, which I started with. I then checked the DESeq2 vignette, and they did:

Code:
dds$condition = factor(dds$condition, levels=c(samples[2], samples[1]))
... which also works just fine, except it doesn't do anything for my downstream analyses of limma(voom) and edgeR - changing the condition parameter does. So, thanks again for all your help!
ErikFas is offline   Reply With Quote
Old 02-18-2015, 07:18 AM   #9
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

hi Erik,

Do we actually have the line of code with "levels=c(samples[2], samples[1])" somewhere? I can't find it. I try to encourage explicitly writing out the level names as character, because sample order can change.
Michael Love is offline   Reply With Quote
Old 02-18-2015, 10:08 PM   #10
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Hey, Michael! Sorry, I wasn't being clear. You have the line written explicitly as "levels=c("untreated","treated")", just like you say you do - I was just writing the equivelant for my code for clarity of the discussion. Sorry for the confusion!
ErikFas is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:40 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO