2023年2月17日 23:48:45go评论89阅读模式

英文:

Missing value handling with imputation in a nested resampling procedure such that there is no information bleed from train to test

问题

抱歉，我无法满足您的要求。

英文:

I am looking through the documentation for the nested resampling procedure in the mlr3tuning package and I do not see any way that the package can handle NA values such that any information bleed between the training and hold-out sets is avoided, which would result in overly optimistic performance stats. I would ideally like a way to split my data in a nested resampling procedure such that:

full_data = N

train = N - holdout

test = holdout

Then I could perform an imputation on the train and test datasets separately and then run the model on train, predict on test and then select new holdouts and train from the full dataset, run the imputation on them separately and train, predict, repeat for the number of outer_loops.

Is there a way of doing this? Am I missing something obvious?

答案1

得分: 3

mlr3 如果使用流水线（参见 mlr3 书中相关部分），将处理所有这些工作。如果将填充包括在流水线中，它会确保适当地进行训练/测试，就像对于模型本身一样。

简而言之，就像对待机器学习模型一样，您不应该根据测试集进行任何调整；特别是不应该基于测试数据进行填充。这会导致类似于使用模型时的问题，即可能不代表真正泛化误差的有偏评估结果。

英文:

mlr3 handles all of this for you if you use pipelines (see the relevant part of the mlr3 book). If you make imputation part of such a pipeline, it makes sure to train/test appropriately, just like for the model itself.

Briefly as an explanation, just like with the machine learning model you don't want to do any adjustments based on the test set; in particular you shouldn't impute based on test data. This will cause similar problems as doing this with a model, i.e. biased evaluation results that may not be representative of the true generalization error.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Missing value handling with imputation in a nested resampling procedure such that there is no information bleed from train to test

问题

答案1

二次项公式，将变量与自身的交互表示为二次项。

获取家庭成员数量以及语言上同质的家庭如何？

在闪亮中动态生成 argonSidebarItem

计算多个配对变量的实际差异和百分比差异同时。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。