SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   DEXSeq error in estimateDispersions: match.arg(start.method, c("log(y)", "mean")) (http://seqanswers.com/forums/showthread.php?t=23589)

fpadilla 09-24-2012 06:03 PM

DEXSeq error in estimateDispersions: match.arg(start.method, c("log(y)", "mean"))
 
Hey Guys:

I got the following error many times while calling estimateDispersions:

Error in match.arg(start.method, c("log(y)", "mean")) : 'arg' must be NULL or a character vector

Below is the code I'm runing, the weird thing is execution of estimateDispersions is finished and continue to fitDispersionFunction, when it is stopped because estimateDispersions failed.

Code:

exonStats = read.HTSeqCounts(
                countfiles = file.path("./", paste(rownames(samples),"HTSeqCount.txt",sep=".") ),
                design = samples, flattenedfile = annotationFile)

exonStats <- estimateSizeFactors(exonStats)
exonStats <- estimateDispersions(exonStats, quiet=TRUE) # Error is here
exonStats <- fitDispersionFunction(exonStats) # Execution stop here

Also, I have a few datasets for testing and still I get error. Here is the experiment design

Code:

> samples
      condition replicate        type
31 4cm from base        1 single-read
32 4cm from base        2 single-read
33          tip        1 single-read
34          tip        2 single-read

This is the session info:
Code:

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8      LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
 [7] LC_PAPER=C                LC_NAME=C               
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C     

attached base packages:
[1] parallel  tools    stats    graphics  grDevices utils    datasets
[8] methods  base   

other attached packages:
[1] multicore_0.1-7 DEXSeq_1.0.2    Biobase_2.14.0

loaded via a namespace (and not attached):
[1] hwriter_1.3    plyr_1.7.1    statmod_1.4.15 stringr_0.6.1

I'll appreciate your help !

Francisco

fpadilla 09-28-2012 10:25 AM

For the record:

I upgraded R version to 2.15.1 and Bioconductor to 2.10 and the error doesn't ocurr any more, but now I get a new error at estimatelog2FoldChanges function:

Error: subscript out of bounds

exonStats <- estimatelog2FoldChanges( exonStats )

Code:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8      LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
 [7] LC_PAPER=C                LC_NAME=C               
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C     

attached base packages:
[1] parallel  tools    stats    graphics  grDevices utils    datasets
[8] methods  base   

other attached packages:
[1] multicore_0.1-7    DEXSeq_1.2.1      Biobase_2.16.0    BiocGenerics_0.2.0

loaded via a namespace (and not attached):
[1] biomaRt_2.12.0 hwriter_1.3    plyr_1.7.1    RCurl_1.91-1  statmod_1.4.15
[6] stringr_0.6.1  XML_3.6-2


areyes 10-01-2012 03:44 AM

Hi Francisco,

Can you post the error message that you are getting now?

katzman 10-09-2012 10:49 AM

I am a longtime DESeq user (and a big fan).

On my first attempt with DEXSeq, I got the same message as fpadilla while calling estimateDispersions():

Error in match.arg(start.method, c("log(y)", "mean")) : 'arg' must be NULL or a character vector

Thanks to fpadilla for the suggestion to use R 2.15.1

That eliminated that problem. Fortunately, I have not seen the subsequent problem with estimatelog2FoldChanges(). I am not (yet) running multicore.

My sessionInfo is below.

/Sol.

Code:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats    graphics  grDevices utils    datasets  methods  base   

other attached packages:
[1] DEXSeq_1.2.1      Biobase_2.16.0    BiocGenerics_0.2.0

loaded via a namespace (and not attached):
[1] RCurl_1.95-0.1.2 XML_3.95-0.1    biomaRt_2.12.0  hwriter_1.3      plyr_1.7.1      statmod_1.4.16  stringr_0.6.1    tools_2.15.1


fpadilla 10-11-2012 01:38 PM

I figurate out! The error is because still I have the replicate column in the design data.frame. Also I have space characters in the values of the condition column.

zimmernv 12-03-2012 08:22 AM

Hey Francisco,

I have the same "Error: match.arg()" that you had previously. I have the most up-to-date versions of R (2.15.2), DEXSeq (1.4.0), and Bioconductor (2.11) -- or so I think. But the error does not disappear.

My ecs variable has valid count data (not the 0's or NULL values that have been known to occur). So can you explain in greater detail what you mean by "I still have replicate column in the design data.frame" and "space characters in the values of the condition column"? Should you not have a replicate column when you use estimateDispersions()? My design definitely still has that column but I never saw any indication in the latest vignette to remove that information.

Could you show how you wrote your "design"? I have two controls and two transgenics, so I just wrote:

> samples = data.frame(condition = c(rep("control", 2), rep("transgenic", 2)), replicate = c(1:2, 1:2), row.names = countfiles, stringsAsFactors=TRUE, check.names=FALSE)

where 'countfiles' is defined earlier in my script.

I welcome any help. Thanks in advance.

areyes 12-03-2012 12:07 PM

Dear zimmernv,

To remove the "replicate" column, instead of:

Code:

> samples = data.frame(condition = c(rep("control", 2), rep("transgenic", 2)), replicate = c(1:2, 1:2), row.names = countfiles, stringsAsFactors=TRUE, check.names=FALSE)
just do

Code:

> samples = data.frame(condition = c(rep("control", 2), rep("transgenic", 2)), row.names = countfiles, stringsAsFactors=TRUE, check.names=FALSE)
In fact, you are correct in saying that you did not see any change in the documentation, I should have added that. The exact change is that all the columns of the design data frame should be factors, so it would also solve the problem to convert your "replicate" column into a factor.

Let me know if that solves the problem, if not, could you add the complete (reproducible) code that you used and the complete output of the sessionInfo() ?

Alejandro

zimmernv 12-12-2012 12:49 PM

thanks ... one more question
 
Hey Alejandro,

Thanks for the response. That fixed the issue and I can now visualize my results.

Somewhat astonishingly, the differential exon usage test does not yield any positive differential splicing results for my dataset (2 replicates for both the controls and the transgenics, with very high count numbers). For a whole-transcriptome 45 million reads (i.e. decent coverage) RNA-seq experiment, not seeing a single alternative splicing event is a bit odd, don't you think?

One area of concern is the use of the non-overlapping exonic regions made via the dexseq_prepare_annotation.py script. I can understand the motivation for creating these regions, but I'm struggling to see how I can then convert the final output (counts assigned to new but non-biological exonic regions) to biologically-meaningful outputs (i.e. deconvolving the counts over these exonic regions into counts over the exons that helped to build the exonic region). To take an example, the Apoer2 gene in M. musculus has 20 exons. But the dexseq script produces a gff file that lists Apoer2 as having 40 exonic regions due to the overlaps of transcripts. The DEXSeq plots then show 40 exons rather than the conventional 20. How can I get an image that would reveal the counts over the 20 exons? Should I manually change the exonCountSet to reflect the biological reality?

To summarize the difficulty I'm having: DEXSeq, as I see it, is testing for Differential Exonic Regions rather than Differential Exons (where only the latter is biologically-meaningful, or so I think ... I come from the computer science side more so than biology).

I eagerly await your critique and insights.

areyes 12-14-2012 12:25 AM

Hey zimmernv,

Is there any obvious change that you would expect any differential exon usage regulation between the transgenics and the controls? (e.g. the inserted gene being somehow related to the splicing machinery?)

I guess the exon bining is an attempt to test for all possible exon usage regulation according by testing all unique non-overlapping exonic parts from any transcript, but at the end its an optional step. You could input in DEXSeq with "real" exons. My guess is that it won't make a huge difference, but let me know if I am wrong!

Alejandro

areyes 12-14-2012 02:04 AM

Also you could check if you have enough counts in the exons to achieve statistical power to detect something, data quality or if you have a lot of variability between replicates!

zimmernv 12-19-2012 12:41 PM

Quote:

Originally Posted by areyes (Post 91608)
Hey zimmernv,

Is there any obvious change that you would expect any differential exon usage regulation between the transgenics and the controls? (e.g. the inserted gene being somehow related to the splicing machinery?)

I guess the exon bining is an attempt to test for all possible exon usage regulation according by testing all unique non-overlapping exonic parts from any transcript, but at the end its an optional step. You could input in DEXSeq with "real" exons. My guess is that it won't make a huge difference, but let me know if I am wrong!

Alejandro

Hey Alejandro,

I tried it again with "real" exons, and I got 1 differentially-expressed exon rather than 0. So, as you implied, the difference is not substantial (though that 1 hit is very interesting as far as the biology is concerned).

I was expecting a more pronounced change in exon regulation -- the gene that is inserted is a known splicing factor, so a single event of differential splicing is a bit unexpected. I'm using BitSeq at the moment to see if I can reproduce the results.

Speaking of comparisons, I ran into a thread on SeqAnswers in which S. Anders commented that a golden standard for gene expression and exon expression would be to have 20 replicates of one condition vs. 20 replicates of another condition and then using inferential statistics to find the significant hits. I'm curious about the number 20 -- is there a statistical reason for that value? If it can spare you a long technical explanation, could you send me the titles of the relevant papers?

Thanks a ton,
Vincent

fpadilla 12-26-2012 10:18 AM

Sorry guys. I missed this post, I'm glad you have fixed it!

Wolfgang Huber 12-27-2012 01:21 AM

I am pretty sure Simon used '20' just to indicate a sufficiently large number, so that the power of standard tests (e.g. Wilcoxon, t- or permutation tests) is sufficient to detect every relevant differentially expressed feature.

muthu545 06-07-2013 02:22 PM

Hi,

I'm am novice to using DEXSeq.
@zimmernv and @Alejandro
How could you get the flattened GTF file with 'real' exons instead of the exonic parts?

Thanks
--
Muthu

Quote:
Originally Posted by areyes

I guess the exon bining is an attempt to test for all possible exon usage regulation according by testing all unique non-overlapping exonic parts from any transcript, but at the end its an optional step. You could input in DEXSeq with "real" exons. My guess is that it won't make a huge difference, but let me know if I am wrong!

Quote:
Originally Posted by zimmernv

I tried it again with "real" exons, and I got 1 differentially-expressed exon rather than 0. So, as you implied, the difference is not substantial (though that 1 hit is very interesting as far as the biology is concerned).

muthu545 07-03-2013 02:11 PM

Areyes and Zimmernv,

Could you let me know, how could I get flattened GTF file with 'real' exons instead of the exonic parts?

Thanks
--
Muthu


All times are GMT -8. The time now is 07:03 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.