英文:
Environmental problems while predicting from gaulss-gams with a custom variance function inside a package
问题
The issue you're facing seems to be related to the availability of the custom variance function var.fun
when using predict
with the gam-object
. This issue arises when you've installed the package, but it works when using devtools::load_all()
because of different environments.
To address this problem, you can try exporting the var.fun
function as part of the package's namespace/API. This will make it accessible to the predict
function. However, you mentioned that this may not be a suitable solution in your case due to the visibility of multiple variance functions.
If exporting the function is not an option, you may need to explore other ways to manage the function's availability within the package's environment, possibly through using package-specific environments or attaching the function at the right place in your package's code.
I hope these hints help you tackle the issue you're facing.
英文:
A variance function available within an R package is not found by the
predict
function while doing predictions from a gam-object
constructed previously (mgcv 1.8-31).
The R package shall (amongst other things) predict from gam-objects. All
of the previously constructed models use the gaulss-family and have their
own variance function. Some variance functions are just a linear effect
of a variable, others use more sophisticated custom functions. The
models and variance functions were stored in a 'sysdata.rda' file to
include them in the package. The package was documented with devtools
and roxygen2
.
Consider the following minimal example of two GAM's:
library(mgcv)
set.seed(123)
var.fun <- function(x){x^2}
x <- runif(100)
y <- x + rnorm(100, 0, var.fun(x))
mod.gam.1 <- gam(formula = list(y ~ x,
~ var.fun(x)),
family = gaulss(link = list("log", "logb")))
mod.gam.2 <- gam(formula = list(y ~ x,
~ I(x^2)),
family = gaulss(link = list("log", "logb")))
The first model uses a custom variance function. The second model has
the variance formula hardcoded in the model call.
The models and the variance function are then stored in a 'sysdata.rda'
file, which is included in a package named "gamvarfun" (I know, naming
things... ), as follows:
save(mod.gam.1, var.fun, mod.gam.2,
file = "~/gamvarfun/R/sysdata.rda")
Now two functions are added to the package to retrieve predictions from
the corresponding models:
pred.fun.1 <- function(x){
predict(mod.gam.1,
newdata = data.frame("x" = x))
}
pred.fun.2 <- function(x){
predict(mod.gam.2,
newdata = data.frame("x" = x))
}
To illustrate the problem I've uploaded a very rudimentary demo-package
to github, so the following code should work:
library(devtools)
install_github("jan-schick/gam_issue")
library(gamvarfun)
pred.fun.2(1)
# 0.3135275 0.5443409
pred.fun.1(1)
# Error in var.fun(x) : could not find function "var.fun"
# Writing var.fun to global environment
var.fun <- gamvarfun:::var.fun
pred.fun.1(1)
# 0.3135275 0.5443409
When using pred.fun.1
(containing a custom variance function), an
error is displayed. However, pred.fun.2
(hardcoded variance function)
works perfectly well. This error does not occur, when
devtools::load_all()
is used instead of a 'proper' install of the
package. I suspected that the problem is due to different environments
when using predict.gam
. I tested this assumption by writing the custom
variance function to the global environment before calling pred.fun.1
(see above), which worked. However, this is obviously no solution for a
package.
In earlier attempts I tried to declare the function inside the package,
e.g. by placing the code written above directly inside the prediction
functions. Which also didn't work when the package was installed.
pred.fun.1 <- function(x){
var.fun <- function(x){x^2}
predict(mod.gam.1,
newdata = data.frame("x" = x))
}
I've also tried with
and attach
at the same place (inside the
prediction functions) without any success.
The only working solution I found, was to export the variance functions
and make it part of the package namespace/API. This is no feasible
solution in this case as it would lead to numerous visible variance
functions, which have no practical benefit for the user.
Then there is the obvious workaround: replacing the variance functions
by the original formulas in the model call, i.e. using mod.gam.2
instead of mod.gam.1
. However, this is not a proper solution
either.
Various search engines and colleagues have been consulted to no avail.
Thus, I would be grateful for any hints how to tackle this
problem
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gamvarfun_0.1.0 usethis_1.5.0 devtools_2.0.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 rstudioapi_0.10 magrittr_1.5 splines_3.6.2
[5] pkgload_1.0.2 lattice_0.20-38 R6_2.4.0 rlang_0.4.0
[9] tools_3.6.2 grid_3.6.2 pkgbuild_1.0.3 nlme_3.1-143
[13] mgcv_1.8-31 sessioninfo_1.1.1 cli_1.1.0 withr_2.1.2
[17] remotes_2.0.4 assertthat_0.2.1 digest_0.6.20 rprojroot_1.3-2
[21] crayon_1.3.4 Matrix_1.2-18 processx_3.3.1 callr_3.2.0
[25] fs_1.3.1 ps_1.3.0 curl_4.0 testthat_2.1.1
[29] memoise_1.1.0 glue_1.3.1 compiler_3.6.2 desc_1.2.0
[33] backports_1.1.4 prettyunits_1.0.2
答案1
得分: 2
问题在于公式与环境相关联,而您正在使用 globalenv()
进行关联,但将函数放在其他地方。
如果您坚持使用 sysdata.rda
,那么修复问题并不容易,因为 mod.gam.1
对象将该环境复制到多个位置。仅仅修补 environment(mod.gam.1$formula)
是不够的。
我认为唯一可行的方法是在软件包源代码中包含生成 mod.gam.1
对象的源代码。如果公式在继承自软件包命名空间的上下文中定义,那么在那里定义的方差函数将被找到。
编辑以补充: 我尝试了这个策略,但无法使其正常工作。看起来 mgcv
包中存在一个 bug,特别是在解释 gualss
模型使用的特殊双重公式的代码中。我将尝试找到一种解决方法。
第二次编辑: 运行下面的代码似乎修复了 mgcv
中具有 bug 的函数。这些函数基于 mgcv
版本 1.8-31。我不建议运行此脚本,除非您已经安装了完全相同的版本。我已向包维护者发送了消息,也许这些修复将被纳入将来的版本中。
interpret.gam0 <- function(gf,textra=NULL,extra.special=NULL)
# 解释通用形式的 gam 公式:
# y~x0+x1+x3*x4 + s(x5)+ s(x6,x7) ....
# 并返回:
# 1. 参数部分的模型公式: pf (和 pfok 指示是否有项)
# 2. 平滑项的描述列表: smooth.spec
# 此函数执行工作,并由 interpret.gam 调用
# 'textra' 是可选的文本,用于添加到项标签
# 'extra.special' 是公式中额外平滑项的标签。
{
p.env <- environment(gf) # 公式的环境
tf <- terms.formula(gf,specials=c("s","te","ti","t2",extra.special)) # specials 属性指示哪些项是平滑项
terms <- attr(tf,"term.labels") # 模型项的标签
nt <- length(terms) # 有多少个项?
if (attr(tf,"response") > 0) { # 开始替换公式
response <- as.character(attr(tf,"variables")[2])
} else {
response <- NULL
}
sp <- attr(tf,"specials")$s # 平滑项的索引数组
tp <- attr(tf,"specials")$te # 张量积项的索引
tip <- attr(tf,"specials")$ti # 张量积纯交互项的索引
t2p <- attr(tf,"specials")$t2 # 类型 2 张量积项的索引
zp <- if (is.null(extra.special)) NULL else attr(tf,"specials")[[extra.special]]
off <- attr(tf,"offset") # 偏移在公式中的位置
## 必须将 sp、tp、tip、t2p (zp) 转换为与项相关联,而不是公式的元素...
vtab <- attr(tf,"factors") # 变量与项的交叉表
if (length(sp)>0) for (i in 1:length(sp)) {
ind <- (1:nt)[as.logical(vtab[sp[i],])]
sp[i] <- ind # 与平滑相关的项
}
if (length(tp)>0) for (i in 1:length(tp)) {
ind <- (1:nt)[as.logical(vtab[tp[i],])]
tp[i] <- ind # 与张量积相关的项
}
if (length(tip)>0) for (i in 1:length(tip)) {
ind <- (1:nt)[as.logical(vtab[tip[i],])]
tip[i] <- ind # 与张量积纯交互项相关的项
}
if (length(t2p)>0) for (i in 1:length(t2p)) {
ind <- (1:nt)[as.logical(vtab[t2p[i],])]
t2p[i] <- ind # 与类型 2 张量积项相关的项
}
if (length(zp)>0) for (i in 1:length(zp)) {
ind <- (1:nt)[as.logical(vtab[zp[i],])]
zp[i] <- ind # 与额外平滑项相关的项
} ## 重新引用完成
k <- kt <- kti <- kt2 <- ks <- kz <- kp <- 1 # 项计数器
len.sp <- length(sp)
len.tp <- length(tp)
len.tip <- length(tip)
len.t2p <- length(t2p)
len.zp <- length(zp)
ns <- len.sp + len.tp + len.tip + len.t2p + len.zp# 平滑项的数量
pav <- av <- rep("",0)
smooth.spec <- list()
# mgcvat <- "package:mgcv" %in% search() ## mgcv 是否在搜索路径中?
mgcvns <- loadNamespace('mgcv')
if (nt) for (i in 1:nt) { # 遍历所有项
if (k <= ns&&((ks<=len.sp&&sp[ks]==i)||(kt<=len.tp&&tp[kt]==i)||(kz<=len.zp&&zp[kz]==i)||
(kti<=len.tip&&tip[kti]==i)||(kt2<=len.t2p&&t2p[kt2]==i))) { # 它是一个平滑项
## 必须在公式的环境中进行评估,否则找不到平滑参数作为参数提供,例如 k <- 5; gam(y~s(x,k=k)) 会失败,
## 但如果您不指定 mgcv 的命名空间,那么像下面这样的东西
## loadNamespace('mg
<details>
<summary>英文:</summary>
The problem is that formulas have environments associated with them, and you are saving the association with `globalenv()`, but putting the function somewhere else.
This isn't easy to fix if you stick with `sysdata.rda`, because the `mod.gam.1` object copies that environment into multiple locations. It's not enough to patch `environment(mod.gam.1$formula)`.
I think the only feasible approach is to include the source that produced the `mod.gam.1` object within the package source. If the formulas are defined in a context that inherits from the package namespace, then the variance function defined there will be found.
**Edited to add:** I tried this strategy, and couldn't get it to work. It looks as though there's a bug in the `mgcv` package, specifically in the code that interprets the special dual formula that the `gualss` model uses. I'll see if I can find a workaround.
**Second edit:** Running the code below appears to fix the functions in `mgcv` that have the bug. They are based on `mgcv` version 1.8-31. I don't recommend running this script unless you have **exactly** that version installed. I've sent a message to the package maintainer; maybe these will be incorporated into a future release.
interpret.gam0 <- function(gf,textra=NULL,extra.special=NULL)
# interprets a gam formula of the generic form:
# y~x0+x1+x3*x4 + s(x5)+ s(x6,x7) ....
# and returns:
# 1. a model formula for the parametric part: pf (and pfok indicating whether it has terms)
# 2. a list of descriptors for the smooths: smooth.spec
# this is function does the work, and is called by in interpret.gam
# 'textra' is optional text to add to term labels
# 'extra.special' is label of extra smooth within formula.
{ p.env <- environment(gf) # environment of formula
tf <- terms.formula(gf,specials=c("s","te","ti","t2",extra.special)) # specials attribute indicates which terms are smooth
terms <- attr(tf,"term.labels") # labels of the model terms
nt <- length(terms) # how many terms?
if (attr(tf,"response") > 0) { # start the replacement formulae
response <- as.character(attr(tf,"variables")[2])
} else {
response <- NULL
}
sp <- attr(tf,"specials")$s # array of indices of smooth terms
tp <- attr(tf,"specials")$te # indices of tensor product terms
tip <- attr(tf,"specials")$ti # indices of tensor product pure interaction terms
t2p <- attr(tf,"specials")$t2 # indices of type 2 tensor product terms
zp <- if (is.null(extra.special)) NULL else attr(tf,"specials")[[extra.special]]
off <- attr(tf,"offset") # location of offset in formula
## have to translate sp, tp, tip, t2p (zp) so that they relate to terms,
## rather than elements of the formula...
vtab <- attr(tf,"factors") # cross tabulation of vars to terms
if (length(sp)>0) for (i in 1:length(sp)) {
ind <- (1:nt)[as.logical(vtab[sp[i],])]
sp[i] <- ind # the term that smooth relates to
}
if (length(tp)>0) for (i in 1:length(tp)) {
ind <- (1:nt)[as.logical(vtab[tp[i],])]
tp[i] <- ind # the term that smooth relates to
}
if (length(tip)>0) for (i in 1:length(tip)) {
ind <- (1:nt)[as.logical(vtab[tip[i],])]
tip[i] <- ind # the term that smooth relates to
}
if (length(t2p)>0) for (i in 1:length(t2p)) {
ind <- (1:nt)[as.logical(vtab[t2p[i],])]
t2p[i] <- ind # the term that smooth relates to
}
if (length(zp)>0) for (i in 1:length(zp)) {
ind <- (1:nt)[as.logical(vtab[zp[i],])]
zp[i] <- ind # the term that smooth relates to
} ## re-referencing is complete
k <- kt <- kti <- kt2 <- ks <- kz <- kp <- 1 # counters for terms in the 2 formulae
len.sp <- length(sp)
len.tp <- length(tp)
len.tip <- length(tip)
len.t2p <- length(t2p)
len.zp <- length(zp)
ns <- len.sp + len.tp + len.tip + len.t2p + len.zp# number of smooths
pav <- av <- rep("",0)
smooth.spec <- list()
#mgcvat <- "package:mgcv" %in% search() ## is mgcv in search path?
mgcvns <- loadNamespace('mgcv')
if (nt) for (i in 1:nt) { # work through all terms
if (k <= ns&&((ks<=len.sp&&sp[ks]==i)||(kt<=len.tp&&tp[kt]==i)||(kz<=len.zp&&zp[kz]==i)||
(kti<=len.tip&&tip[kti]==i)||(kt2<=len.t2p&&t2p[kt2]==i))) { # it's a smooth
## have to evaluate in the environment of the formula or you can't find variables
## supplied as smooth arguments, e.g. k <- 5;gam(y~s(x,k=k)), fails,
## but if you don't specify namespace of mgcv then stuff like
## loadNamespace('mgcv'); k <- 10; mgcv::interpret.gam(y~s(x,k=k)) fails (can't find s)
## eval(parse(text=terms[i]),envir=p.env,enclos=loadNamespace('mgcv')) fails??
## following may supply namespace of mgcv explicitly if not on search path...
## If 's' etc are masked then we can fail even if mgcv on search path, hence paste
## of explicit mgcv reference into first attempt...
st <- try(eval(parse(text=paste("mgcv::",terms[i],sep="")),envir=p.env),silent=TRUE)
if (inherits(st,"try-error")) st <-
eval(parse(text=terms[i]),enclos=p.env,envir=mgcvns)
if (!is.null(textra)) { ## modify the labels on smooths with textra
pos <- regexpr("(",st$lab,fixed=TRUE)[1]
st$label <- paste(substr(st$label,start=1,stop=pos-1),textra,
substr(st$label,start=pos,stop=nchar(st$label)),sep="")
}
smooth.spec[[k]] <- st
if (ks<=len.sp&&sp[ks]==i) ks <- ks + 1 else # counts s() terms
if (kt<=len.tp&&tp[kt]==i) kt <- kt + 1 else # counts te() terms
if (kti<=len.tip&&tip[kti]==i) kti <- kti + 1 else # counts ti() terms
if (kt2<=len.t2p&&t2p[kt2]==i) kt2 <- kt2 + 1 # counts t2() terms
else kz <- kz + 1
k <- k + 1 # counts smooth terms
} else { # parametric
av[kp] <- terms[i] ## element kp on rhs of parametric
kp <- kp+1 # counts parametric terms
}
}
if (!is.null(off)) { ## deal with offset
av[kp] <- as.character(attr(tf,"variables")[1+off])
kp <- kp+1
}
pf <- paste(response,"~",paste(av,collapse=" + "))
if (attr(tf,"intercept")==0) {
pf <- paste(pf,"-1",sep="")
if (kp>1) pfok <- 1 else pfok <- 0
} else {
pfok <- 1;if (kp==1) {
pf <- paste(pf,"1");
}
}
fake.formula <- pf
if (length(smooth.spec)>0)
for (i in 1:length(smooth.spec)) {
nt <- length(smooth.spec[[i]]$term)
ff1 <- paste(smooth.spec[[i]]$term[1:nt],collapse="+")
fake.formula <- paste(fake.formula,"+",ff1)
if (smooth.spec[[i]]$by!="NA") {
fake.formula <- paste(fake.formula,"+",smooth.spec[[i]]$by)
av <- c(av,smooth.spec[[i]]$term,smooth.spec[[i]]$by)
} else av <- c(av,smooth.spec[[i]]$term)
}
fake.formula <- as.formula(fake.formula,p.env)
if (length(av)) {
pred.formula <- as.formula(paste("~",paste(av,collapse="+")))
pav <- all.vars(pred.formula) ## trick to strip out 'offset(x)' etc...
pred.formula <- reformulate(pav)
environment(pred.formula) <- environment(gf)
} else pred.formula <- ~1
ret <- list(pf=as.formula(pf,p.env),pfok=pfok,smooth.spec=smooth.spec,
fake.formula=fake.formula,response=response,fake.names=av,
pred.names=pav,pred.formula=pred.formula)
class(ret) <- "split.gam.formula"
ret
} ## interpret.gam0
interpret.gam <- function(gf,extra.special=NULL) {
## wrapper to allow gf to be a list of formulae or
## a single formula. This facilitates general penalized
## likelihood models in which several linear predictors
## may be involved...
##
## The list syntax is as follows. The first formula must have a response on
## the lhs, rather than labels. For m linear predictors, there
## must be m 'base formulae' in linear predictor order. lhs labels will
## be ignored in a base formula. Empty base formulae have '-1' on rhs.
## Further formulae have labels up to m labels 1,...,m on the lhs, in a
## syntax like this: 3 + 5 ~ s(x), which indicates that the same s(x)
## should be added to both linear predictors 3 and 5.
## e.g. A bivariate normal model with common expected values might be
## list(y1~-1,y2~-1,1+2~s(x)), whereas if the second component was contaminated
## by something else we might have list(y1~-1,y2~s(v)-1,1+2~s(x))
##
## For a list argument, this routine returns a list of split.formula objects
## with an extra field "lpi" indicating the linear predictors to which each
## contributes...
if (is.list(gf)) {
d <- length(gf)
## make sure all formulae have a response, to avoid
## problems with parametric sub formulae of the form ~1
#if (length(gf[[1]])<3) stop("first formula must specify a response")
resp <- gf[[1]][2]
ret <- list()
pav <- av <- rep("",0)
nlp <- 0 ## count number of linear predictors (may be different from number of formulae)
for (i in 1:d) {
textra <- if (i==1) NULL else paste(".",i-1,sep="") ## modify smooth labels to identify to predictor
lpi <- getNumericResponse(gf[[i]]) ## get linear predictors to which this applies, if explicit
if (length(lpi)==1) warning("single linear predictor indices are ignored")
if (length(lpi)>0) gf[[i]][[2]] <- NULL else { ## delete l.p. labels from formula response
nlp <- nlp + 1;lpi <- nlp ## this is base formula for l.p. number nlp
}
ret[[i]] <- interpret.gam0(gf[[i]],textra,extra.special=extra.special)
ret[[i]]$lpi <- lpi ## record of the linear predictors to which this applies
## make sure all parametric formulae have a response, to avoid
## problems with parametric sub formulae of the form ~1
respi <- rep("",0) ## no extra response terms
if (length(ret[[i]]$pf)==2) {
ret[[i]]$pf[3] <- ret[[i]]$pf[2];ret[[i]]$pf[2] <- resp
respi <- rep("",0)
} else if (i>1) respi <- ret[[i]]$response ## extra response terms
av <- c(av,ret[[i]]$fake.names,respi) ## accumulate all required variable names
pav <- c(pav,ret[[i]]$pred.names) ## predictors only
}
av <- unique(av) ## strip out duplicate variable names
pav <- unique(pav)
if (length(av)>0) {
## work around - reformulate with response = "log(x)" will treat log(x) as a name,
## not the call it should be...
fff <- formula(paste(ret[[1]]$response,"~ ."))
ret$fake.formula <- reformulate(av,response=ret[[1]]$response)
environment(ret$fake.formula) <- environment(gf[[1]])
ret$fake.formula[[2]] <- fff[[2]] ## fix messed up response
} else ret$fake.formula <- ret[[1]]$fake.formula ## create fake formula containing all variables
ret$pred.formula <- if (length(pav)>0) reformulate(pav) else ~1 ## predictor only formula
environment(ret$pred.formula) <- environment(gf[[1]])
ret$response <- ret[[1]]$response
ret$nlp <- nlp ## number of linear predictors
for (i in 1:d) if (max(ret[[i]]$lpi)>nlp||min(ret[[i]]$lpi)<1) stop("linear predictor labels out of range")
class(ret) <- "split.gam.formula"
return(ret)
} else interpret.gam0(gf,extra.special=extra.special)
} ## interpret.gam
## Now some test code.
environment(interpret.gam) <- environment(mgcv::interpret.gam)
environment(interpret.gam0) <- environment(mgcv:::interpret.gam0)
assignInNamespace("interpret.gam", interpret.gam, "mgcv")
assignInNamespace("interpret.gam0", interpret.gam0, "mgcv")
set.seed(123)
mod.gam.1 <- local({
var.fun <- function(x){x^2}
x <- runif(100)
y <- x + rnorm(100, 0, var.fun(x))
gam(formula = list(y ~ x,
~ var.fun(x)),
family = gaulss(link = list("log", "logb")))
})
pred.fun.1 <- function(x){
predict(mod.gam.1,
newdata = data.frame("x" = x))
}
pred.fun.1(1)
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论