2023年6月8日 13:24:36go评论99阅读模式

英文:

Time issues with training a H2o Autoencoder

问题

我在想是否有人知道我的H2o自动编码器在训练过程中是否存在任何明显的问题，可能会导致它训练时间如此之长？或者是否有人知道我如何可以减少训练这个模型所需的时间，无论是在数据集还是模型构建方面。任何帮助将不胜感激！非常感谢！

我一直在对一个只包含独热编码分类列的数据集上训练H2o自动编码器。数据集的形状为（7762,2232），模型训练大约需要5小时。构建模型的代码如下：

model = H2ODeepLearningEstimator(
    autoencoder = True,
    seed = -1
    hidden = [2000,1000,500,250,125,50],
    epochs = 30,
    activation = &quot;Tanh&quot;
)

英文:

I was wondering if anyone knows if there are any glaring problems with the way my H2o autoencoder is being trained that could cause it to take so long? Or if anyone knows of any way I can reduce the time it takes to train this model, both with the dataset and the model construction. Any help would be greatly appreciated! Thank you very much!

I have been training a H2o autoencoder on a dataset consisting of just one-hot-encoded categorical columns. The dataset is of shape (7762,2232) and the model took about 5 hours to train. The code for building the model is as follows:

model = H2ODeepLearningEstimator(
    autoencoder = True,
    seed = -1
    hidden = [2000,1000,500,250,125,50],
    epochs = 30,
    activation = &quot;Tanh&quot;
)

答案1

得分: 1

问题在于列的数量。虽然行数控制整体训练时间，但列的数量控制每行的训练时间。有2232列相当多。如果您可以进行一些数据处理并减少使用的预测变量数量，肯定会加快训练速度。

您还可以尝试以下方法：

将stopping_tolerance设置为更高的数字：0.1或更高。这将启用提前停止，如果某些指标的平均改善与上次相比没有提高0.1，则停止训练；
如果要在120秒后停止模型构建，可以将max_runtime_secs设置为120；
将score_training_samples从默认值10000减少到5000。这将在较少数量的样本上执行评分，从而可以减少训练时间。

请注意，像1、2中提前停止模型可能会减少模型的训练时间，但可能不适合您的数据。

英文:

The problem here is the number of columns. While the number of rows control the overall training time, the number of columns control the training time per row. Having 2232 is quite a lot. If you can do some data munging and reduce the number of predictors you use, it will definitely speed up training.

You can also try the following:

set stopping_tolerance to a higher number: 0.1 or higher. This will enable early stopping to stop training if the average improvement in some metrics does not improve by 0.1 compared to the last one;
set max_runtime_secs=120 if you want to stop the model building after 120 seconds
reduce score_training_samples from default of 10000 to say 5000. This will perform scoring on a smaller number of samples and hence can reduce training time.

Note that stopping the model early as in 1, 2 may reduce the model training time but will get you a model that may not be a good fit for your data.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

关于训练H2o自编码器的时间问题

问题

答案1

最好的方法是如何迭代每一行，针对以下情况？

无预训练权重的迁移学习

无法使用Pandas在多个列上进行连接。

pandas 频率表与缺失值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。