英文:
How much data do I need to properly train my ML model?
问题
我收集了总共8800个样本,但在数据清理和异常值检测后,我剩下了3507个样本。
这个数量足够用于机器学习模型吗?(套索回归、线性回归、决策树)
我应该再爬取更多数据吗?
我预计需要更多数据,但我想在浪费时间之前与其他人核实。
此外,我应该使用多少数据进行训练和测试?
英文:
I collected 8800 samples total, but after data cleaning and outlier detection I was left with 3507 samples.
Is this enough to put through machine learning models? (lasso, linear regression, decision tree)
Should I scrape more?
I expect more data is needed, but I want to check with others before wasting time.
Also, how much data should I use for training and testing?
答案1
得分: 1
关于机器学习,拥有更多数据总是更好的。
一般来说,随着模型变得更复杂,您需要更多数据来防止过拟合。
例如,单变量线性回归需要的训练数据比卷积神经网络要少。这是因为神经网络有更多的权重,而单变量模型较少。
不幸的是,简单模型的预测能力较差,而复杂模型更强大。 在我们的示例中,这意味着线性回归在建模依赖多个输入的变量时,其预测结果会比神经网络更远离实际值。
至于训练/测试拆分,我建议随机排序所有数据,然后使用80%的数据进行训练,20%的数据进行测试。多次重复此过程以检查您的模型是否适合,无论选择哪些训练数据,这被称为K折交叉验证。
英文:
When it comes to Machine Learning, more data is always better
In general, as your model gets more complex, you'll need more data to prevent overfitting.
For example, a single-variable linear regression requires less data to train than a convolutional neural network. This is because the neural network has more weights than the single-variable model.
Unfortunately, a simple model has less predictive power than a complex one. In our example, this means the linear regression will yield a prediction farther from the actual value than a neural network when trying to model a variable that depends on more than the single input.
As for train/test split, I recommend randomly ordering all the data, and then using 80% for training and 20% for testing. Repeating this process multiple times to check if your model is a good fit regardless of training data selected is called K-Fold Cross Validation
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论