R, brms:在函数调用内部保存模型到文件会保存整个本地环境

huangapple go评论77阅读模式
英文:

R, brms: saving models to file inside a function call saves the entire local environment

问题

我用brms在R中拟合一些模型。数据来自一个每个单词阅读时间的实验,我想在来自不同单词的数据上拟合相同类型的模型,因此我将拟合模型的代码放入一个接受数据作为参数的函数中。我正在将模型保存到文件,以便在进行特定评估时无需重新拟合它们。

然而,我注意到当我调用拟合模型的函数时,brm保存的RDS文件的大小越来越大,即使模型应该具有相同数量的参数。我意识到由于MCMC抽样的随机性,会有一些变化,但似乎发生的情况是模型保存时函数环境中的所有数据都以某种方式出现在RDS文件中的模型对象中。例如,第一个模型有11个参数(3个固定效应,以及2个随机效应的每个效应的1个截距+3个固定效应)。这个模型占用了大约141MB的磁盘空间。第二个模型具有不同的规格,但确切相同数量的参数,占用了大约282MB(2 x 141MB)的磁盘空间。第三个模型再次具有相同数量的参数,占用了大约423MB(3 x 141MB)的磁盘空间,依此类推。

由于这些模型需要很长时间来拟合,我制作了一个简化示例,展示了在数据集较小且抽样较少的情况下出现的相同行为(brms将对ESS提出抱怨,但重点是模型会很快完成,以便可以检查保存文件的大小)。

这是在我的电脑上运行的结果。

注意,m2-function.rds大约是m1-function.rds大小的两倍,而m1-global.rds的大小与m2-global.rds大致相同。

我不确定这是否适用于brms。然而,我运行了一个使用一些简单向量和带有随机数的列表的测试,所有文件大小都完全相同,无论它们是否是从函数内部调用的(结果为5202 KB)。

因此,这似乎不是R中对保存到RDS的任何对象的默认行为。无论如何,brms对保存文件的方式似乎是造成这种情况的原因。我猜想这与它决定从调用环境中包含什么有关,但我不知道如何控制它。

如果不明显,我的问题是:我如何阻止这种情况发生,以便文件不占用大量不必要的空间?在我的情况下,拟合的模型在某些情况下可能已经占用了1GB,因此在每个随后保存的模型中都包含这个量的数据会很快失控。

英文:

I'm fitting some models in R using brms. The data are from an experiment with per-word reading times, and I want to fit the same kinds of models on data from different words, so I put the code to fit the models into a function that accepts the data to run the models on as an argument. I am saving the models to files so that I don't need to re-fit them for certain evaluations I'll be doing.

However, I've noticed that when I call the function that fits the models, the RDS files that brm saves grow larger and larger in size, even when the models should have the same number of parameters. I realize there will be a little variation due to the random nature of MCMC sampling, but what appears to be happening is that all of the data in the function environment at the point when the model is saved is somehow ending up in the RDS with the model object. For instance, the first model has 11 parameters (3 fixed effects, and 1 intercept + 3 fixed effects for each of 2 two random effects). This model takes up ~141 MB on disk. The second model has a different specification, but exactly the same number of parameters, and it takes up ~282 MB (2 x 141 MB) on disk. The third model has, again, the same number of parameters, and it takes up ~423 MB on disk (3 x 141 MB), and so on.

Since these models take a long time to fit, I've made a MWE that shows the same behavior on a smaller dataset with fewer samples drawn (brms will complain about the ESS, but the point is that the models finish quickly so that the sizes of the saved files can be inspected).

library(brms)

fit.models <- function() {
	set.seed(0)
	m1 <- brm(
		formula = Sepal.Length ~ Sepal.Width,
		data = iris,
		cores = 4,
		chains = 4,
		iter = 100,
		file = 'm1-function.rds'
	)
	
	set.seed(0)
	m2 <- brm(
		formula = Sepal.Length ~ Petal.Length,
		data = iris,
		cores = 4,
		chains = 4,
		iter = 100,
		file = 'm2-function.rds'
	)
}

fit.models()

set.seed(0)
m1 <- brm(
	formula = Sepal.Length ~ Sepal.Width,
	data = iris,
	cores = 4,
	chains = 4,
	iter = 100,
	file = 'm1-global.rds'
)

set.seed(0)
m2 <- brm(
	formula = Sepal.Length ~ Petal.Length,
	data = iris,
	cores = 4,
	chains = 4,
	iter = 100,
	file = 'm2-global.rds'
)

Here's the result of running this on my computer:
R, brms:在函数调用内部保存模型到文件会保存整个本地环境

Note that m2-function.rds is roughly twice as large as m1-function.rds, while m1-global.rds is about the same size as m2-global.rds.

I'm not sure if this is unique to brms. However, I ran a test using some simple vectors and lists with random numbers, and all the file sizes come out exactly the same, regardless of whether they were called from within the function (which turns out to be 5202 KB).

test <- function() {
	x <- list(runif(1e6))
	saveRDS(x, 'x-function.rds')
	
	y <- list(runif(1e6))
	saveRDS(y, 'y-function.rds')
}

test()

x <- list(runif(1e6))
saveRDS(x, 'x-global.rds')

y <- list(runif(1e6))
saveRDS(x, 'y-global.rds')

So this doesn't seem to be default behavior in R for any objects saved to RDS. Whatever it is, something brms is doing with regards to how it saves files seems to be responsible. My guess is that it has something to do with how it decides what to include from the calling environment, but I don't know how to control that.

In case it's not obvious, my question is the following: how can I stop this happening so the files don't take up gobs of unnecessary space? In my case, the fitted models can take up to 1 GB already in some cases, so including that in every subsequent saved model is quickly going to get out of hand.

答案1

得分: 2

我不知道这能否有所帮助,但我制作了一些粗糙的函数来裁剪环境部分,以便函数可以更紧凑地存储。我最近没有尝试过这些。

这是 butcher package 应该完成的任务类型,但目前它没有任何 brms 方法(但下面的函数 可能 适用于集成...)

hack_size <- function(x, ...) {
    UseMethod("hack_size")
}

hack_size.stanfit <- function(x) {
    x@stanmodel <- structure(numeric(0), class="stanmodel")
    x@.MISC <- new.env()
    return(x)
}

hack_size.brmsfit <- function(x) {
    x$fit <- hack_size(x$fit)
    return(x)
}

hack_size.stanreg <- function(x) {
    x$stanfit <- hack_size(x$stanfit)
    return(x)
}

运行后:

saveRDS(hack_size(m1), "m1-hack.rds")
saveRDS(hack_size(m2), "m2-hack.rds")

我得到:

 32M Apr  3 18:43 m2-function.rds
 22M Apr  3 18:43 m1-function.rds
 11M Apr  3 18:43 m1-global.rds
 11M Apr  3 18:43 m2-global.rds
 79K Apr  3 18:46 m1-hack.rds
 77K Apr  3 18:46 m2-hack.rds

我不确定经过修改的版本具有什么功能,但我在 broom.mixed 的示例中使用这个,所以它们不是 完全 被削弱...

英文:

I don't know if this will help or not, but I made some hacky functions for chopping out environment bits so that functions could be stored more compactly. I haven't experimented with these lately.

This is the kind of task that the butcher package is supposed to do, but at present it doesn't have any brms methods (but the functions below might be suitable for integration there ...)

hack_size <- function(x, ...) {
    UseMethod("hack_size")
}

hack_size.stanfit <- function(x) {
    x@stanmodel <- structure(numeric(0), class="stanmodel")
    x@.MISC <- new.env()
    return(x)
}

hack_size.brmsfit <- function(x) {
    x$fit <- hack_size(x$fit)
    return(x)
}

hack_size.stanreg <- function(x) {
    x$stanfit <- hack_size(x$stanfit)
    return(x)
}

After running

saveRDS(hack_size(m1), "m1-hack.rds")
saveRDS(hack_size(m2), "m2-hack.rds")

I get

 32M Apr  3 18:43 m2-function.rds
 22M Apr  3 18:43 m1-function.rds
 11M Apr  3 18:43 m1-global.rds
 11M Apr  3 18:43 m2-global.rds
 79K Apr  3 18:46 m1-hack.rds
 77K Apr  3 18:46 m2-hack.rds

I don't know exactly what functionality the hacked version is capable of, but I use this in the examples for broom.mixed, so they're not completely crippled ...

答案2

得分: 1

这是对Ben Bolker答案的扩展。对我来说没有完全起作用,因为环境也作为formuladata的一部分被存储了。我还必须使用new.env(parent = baseenv()),而不仅仅是new.env(),因为对我来说似乎这样不起作用。我还删除了替换stanmodel的那一行,因为我正在进行的贝叶斯因子分析需要它存在。我将这个作为额外的回答添加,而不是一个评论,因为对代码进行了足够多的更改,不适合作为评论,而且格式很难跟随。

所以,除了Ben的代码,我添加了这些函数:

hack_size.brmsformula <- function(x) {
    environment(x$formula) <- new.env(parent = baseenv())
    return(x)
}

hack_size.data.frame <- function(x) {
    environment(attr(x, "terms")) <- new.env(parent = baseenv())
    return(x)
}

并修改了brmsfitstanreghack_size函数以调用这些函数:

hack_size.brmsfit <- function(x) {
    x$formula <- hack_size(x$formula)
    x$data <- hack_size(x$data)
    x$fit <- hack_size(x$fit)
    return(x)
}

hack_size.stanreg <- function(x) {
    x$formula <- hack_size(x$formula)
    x$data <- hack_size(x$data)
    x$stanfit <- hack_size(x$stanfit)
    return(x)
}

我还略微修改了hack_size.stanfit函数:

hack_size.stanfit <- function(x) {
    x@.MISC <- new.env(parent = baseenv())
    return(x)
}

这在函数内部拟合模型时起作用。文件大小的减小并不像删除stanmodel时那样明显,所以如果不再需要它,可以删除它。 (时间将会告诉我这对我的分析流程是否有副作用,但现在一切似乎都能正常工作。)

英文:

This is an extension of Ben Bolker's answer. It didn't quite work for me because the environment was also being stored as part of the formula and data as well. I also had to use new.env(parent = baseenv()), instead of just new.env(), since that didn't seem to work on its own for me. I also removed the line that replaced the stanmodel, since the Bayes factor analyses I'm also doing require it to be there. I'm adding this as an additional answer rather than a comment since there were enough changes to the code that it wouldn't fit in a comment, and the formatting would be hard to follow.

So, in addition to Ben's code, I added these functions:

hack_size.brmsformula <- function(x) {
    environment(x$formula) <- new.env(parent = baseenv())
    return(x)
}

hack_size.data.frame <- function(x) {
    environment(attr(x, "terms")) <- new.env(parent = baseenv())
    return(x)
}

And I modified the brmsfit and stanreg hack_size functions to call these:

hack_size.brmsfit <- function(x) {
    x$formula <- hack_size(x$formula)
    x$data <- hack_size(x$data)
    x$fit <- hack_size(x$fit)
    return(x)
}

hack_size.stanreg <- function(x) {
    x$formula <- hack_size(x$formula)
    x$data <- hack_size(x$data)
    x$stanfit <- hack_size(x$stanfit)
    return(x)
}

I also slightly modified the hack_size.stanfit function:

hack_size.stanfit <- function(x) {
    x@.MISC <- new.env(parent = baseenv())
    return(x)
}

This worked when fitting the models inside of the function. The reduction in file size isn't quite as dramatic as when removing the stanmodel, so if that could be removed if you won't need it again. (Time will tell if there are side effects of this for my analysis pipeline, but everything seems to work for now.)

huangapple
  • 本文由 发表于 2023年4月4日 06:11:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75924112.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定