2023年6月13日 07:08:19go评论100阅读模式

英文:

How much data do I need to properly train my ML model?

问题

我收集了总共8800个样本，但在数据清理和异常值检测后，我剩下了3507个样本。

这个数量足够用于机器学习模型吗？（套索回归、线性回归、决策树）
我应该再爬取更多数据吗？

我预计需要更多数据，但我想在浪费时间之前与其他人核实。

此外，我应该使用多少数据进行训练和测试？

英文:

I collected 8800 samples total, but after data cleaning and outlier detection I was left with 3507 samples.

Is this enough to put through machine learning models? (lasso, linear regression, decision tree)
Should I scrape more?

I expect more data is needed, but I want to check with others before wasting time.

Also, how much data should I use for training and testing?

答案1

得分: 1

关于机器学习，拥有更多数据总是更好的。

一般来说，随着模型变得更复杂，您需要更多数据来防止过拟合。

例如，单变量线性回归需要的训练数据比卷积神经网络要少。这是因为神经网络有更多的权重，而单变量模型较少。

不幸的是，简单模型的预测能力较差，而复杂模型更强大。 在我们的示例中，这意味着线性回归在建模依赖多个输入的变量时，其预测结果会比神经网络更远离实际值。

至于训练/测试拆分，我建议随机排序所有数据，然后使用80%的数据进行训练，20%的数据进行测试。多次重复此过程以检查您的模型是否适合，无论选择哪些训练数据，这被称为K折交叉验证。

英文:

When it comes to Machine Learning, more data is always better

In general, as your model gets more complex, you'll need more data to prevent overfitting.

For example, a single-variable linear regression requires less data to train than a convolutional neural network. This is because the neural network has more weights than the single-variable model.

Unfortunately, a simple model has less predictive power than a complex one. In our example, this means the linear regression will yield a prediction farther from the actual value than a neural network when trying to model a variable that depends on more than the single input.

As for train/test split, I recommend randomly ordering all the data, and then using 80% for training and 20% for testing. Repeating this process multiple times to check if your model is a good fit regardless of training data selected is called K-Fold Cross Validation

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

需要多少数据才能有效地训练我的机器学习模型？

问题

答案1

使用psd-tools更改文本图层

如何在PyQt5中使表格小部件中的进度条跨足两个单元格？

Pandas “Consecutive”/Rolling Percent Rank

如何在OpenCV-Python 4.8.0中估计单个标记的姿态？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。