英文:
Missing value handling with imputation in a nested resampling procedure such that there is no information bleed from train to test
问题
抱歉,我无法满足您的要求。
英文:
I am looking through the documentation for the nested resampling procedure in the mlr3tuning package and I do not see any way that the package can handle NA values such that any information bleed between the training and hold-out sets is avoided, which would result in overly optimistic performance stats. I would ideally like a way to split my data in a nested resampling procedure such that:
full_data = N
train = N - holdout
test = holdout
Then I could perform an imputation on the train and test datasets separately and then run the model on train, predict on test and then select new holdouts and train from the full dataset, run the imputation on them separately and train, predict, repeat for the number of outer_loops.
Is there a way of doing this? Am I missing something obvious?
答案1
得分: 3
mlr3 如果使用流水线(参见 mlr3 书中相关部分),将处理所有这些工作。如果将填充包括在流水线中,它会确保适当地进行训练/测试,就像对于模型本身一样。
简而言之,就像对待机器学习模型一样,您不应该根据测试集进行任何调整;特别是不应该基于测试数据进行填充。这会导致类似于使用模型时的问题,即可能不代表真正泛化误差的有偏评估结果。
英文:
mlr3 handles all of this for you if you use pipelines (see the relevant part of the mlr3 book). If you make imputation part of such a pipeline, it makes sure to train/test appropriately, just like for the model itself.
Briefly as an explanation, just like with the machine learning model you don't want to do any adjustments based on the test set; in particular you shouldn't impute based on test data. This will cause similar problems as doing this with a model, i.e. biased evaluation results that may not be representative of the true generalization error.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论