英文:
Training on entire dataset in AutoML function of h2o
问题
I am using h2o.automl function in R and here you can find the function below;
我正在使用R中的h2o.automl函数,您可以在下面找到该函数;
h2o.automl(
x = x_name,
y = y_name,
training_frame = as.h2o(train),
leaderboard_frame = as.h2o(test),
max_runtime_secs = 20*60,
exclude_algos = c("XGBoost")
)
因此,我对在从此函数获取领导模型后对整个数据集进行最后拟合感到困惑。在这种情况下,将对训练数据应用交叉验证以找到最佳模型,并且leaderboard_frame仅用于评分。因此,测试子集不会用于任何培训过程吗?在使用交叉验证折叠找到最佳模型进行培训后,h2o.automl是否会在整个数据集上拟合模型?
因为我想在运营模型中使用这个模型,并且也想使用整个数据集,因为我不想丢失任何信息/数据。如果我不提供leaderboard_frame会怎么样?我知道在这种情况下会显示交叉验证折叠的性能,但是在使用交叉验证折叠找到最佳超参数和模型后,h2o.automl会拟合一个最终模型到整个数据集吗?
换句话说,在Kaggle竞赛中,我如何确保使用h2o.automl来预测未见数据时使用整个数据集?顺便说一句,这是一个时间序列预测竞赛,年度时间对模型也有非常重要的影响。他们提供了10年的每小时时间序列数据,六月是竞赛主办方希望您预测的月份。我希望我的模型在六月表现更好,使用h2o.automl,您对此有什么建议?
最后一个问题;为了拥有一个特定于七月的模型,您是否会通过从训练数据集中筛选出七月份并找到在七月份表现良好的最佳超参数来训练模型?或者会包括七月份的数据?在这种情况下,您的训练/测试/验证和交叉验证子集将是什么样的?由于我想使用h2o.automl函数,您能否将您的答案应用于h2o.automl?
英文:
I am using h2o.automl function in R and here you can find the function below;
h2o.automl(
x = x_name,
y = y_name,
training_frame = as.h2o(train),
leaderboard_frame = as.h2o(test),
max_runtime_secs = 20*60,
exclude_algos = c("XGBoost")
)
So, I'm confused about the last final fit on the entire dataset after getting the leader model from this function. In this case, cross-validation will be applied to the training data to find the best models and leaderboard_frame is only used for scoring. So the test subset is not used in any training process? After finding the best model for training with cross-validation folds, does h2o.automl fit a model on the entire dataset?
Because I would like to use this model operationally and use the entire dataset as well since I do not want to lose any information/data on the operational model. What about if I don't give any leaderboard_frame? I know that the performance on the cross-validation folds will be shown in this case, but will h2o.automl model fit a final model to the entire dataset after finding the best hyperparameters and models by using cross-validation folds?
In other words, in a Kaggle competition, how can I use the h2o.automl to make sure to use the entire dataset to predict unseen data? By the way, it is a time-series forecasting competition and the time of the year has also a very crucial effect on the model. They've given a 10-year-long hourly time-series data and June is the month that the competition hosts would like you to predict. I would like my model to perform better in June by using h2o.automl, what do you suggest in this case?
One last question; for having a July-specific model, would you train the model by filtering out the July months from the training dataset and finding the best hyperparameters that perform well in July months? Or would you include the July months in the data? In this case, what would be your train/test/validation and cross-validation subsets? Since I would like to use h2o.automl function, can you please apply your answer to the h2o.automl?
答案1
得分: 1
如果您想要使用整个数据集进行训练,您应该只使用training_frame
,并确保使用交叉验证,您应该指定nfolds
为大于1的数字或指定fold_column
。
如果数据足够大(与计算集群相比),AutoML可以决定使用“混合模式”而不是交叉验证 - 内部将数据拆分为训练/验证集,并使用验证指标进行排行榜排序和训练堆叠集成模型。这仅在nfolds=0
时发生。
回答您的最后一个问题:如果您有那么多数据,我建议尝试将7月份作为排行榜框架,并一旦从7月份的未见数据中选择了AutoML的最佳模型,就使用该模型的参数来在包括7月份的整个数据集上训练一个新模型。但这只是我个人的建议,不一定是最佳方法。
英文:
If you want to use the whole dataset for training you should use just the training_frame
and to be sure you use cross-validation you should specify nfolds
to a number greater than 1 or specify fold_column
.
If the data is big enough (with respect to the computational cluster) the AutoML can decide to use "blending mode" instead of cross-validation - internally split the data to training/validation sets and use validation metrics for leaderboard sorting and training the Stacked Ensembles. This happens only when nfolds=0
.
To answer your last question: if you have that much data, I would try using July as leaderboard frame and once I would have the best model from AutoML selected on unseen data from July, I would use the parameters from that model to train a new model on the whole dataset including the July. But I that's just how I'd do it, not necessarily the best approach.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论