需要多少数据才能有效地训练我的机器学习模型?

huangapple go评论53阅读模式
英文:

How much data do I need to properly train my ML model?

问题

我收集了总共8800个样本,但在数据清理和异常值检测后,我剩下了3507个样本。

这个数量足够用于机器学习模型吗?(套索回归、线性回归、决策树)
我应该再爬取更多数据吗?

我预计需要更多数据,但我想在浪费时间之前与其他人核实。

此外,我应该使用多少数据进行训练和测试?

英文:

I collected 8800 samples total, but after data cleaning and outlier detection I was left with 3507 samples.

Is this enough to put through machine learning models? (lasso, linear regression, decision tree)
Should I scrape more?

I expect more data is needed, but I want to check with others before wasting time.

Also, how much data should I use for training and testing?

答案1

得分: 1

关于机器学习,拥有更多数据总是更好的

一般来说,随着模型变得更复杂,您需要更多数据来防止过拟合。

例如,单变量线性回归需要的训练数据比卷积神经网络要少。这是因为神经网络有更多的权重,而单变量模型较少。

不幸的是,简单模型的预测能力较差,而复杂模型更强大。 在我们的示例中,这意味着线性回归在建模依赖多个输入的变量时,其预测结果会比神经网络更远离实际值。

至于训练/测试拆分,我建议随机排序所有数据,然后使用80%的数据进行训练,20%的数据进行测试。多次重复此过程以检查您的模型是否适合,无论选择哪些训练数据,这被称为K折交叉验证

英文:

When it comes to Machine Learning, more data is always better

In general, as your model gets more complex, you'll need more data to prevent overfitting.

For example, a single-variable linear regression requires less data to train than a convolutional neural network. This is because the neural network has more weights than the single-variable model.

Unfortunately, a simple model has less predictive power than a complex one. In our example, this means the linear regression will yield a prediction farther from the actual value than a neural network when trying to model a variable that depends on more than the single input.

As for train/test split, I recommend randomly ordering all the data, and then using 80% for training and 20% for testing. Repeating this process multiple times to check if your model is a good fit regardless of training data selected is called K-Fold Cross Validation

huangapple
  • 本文由 发表于 2023年6月13日 07:08:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76460804.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定