2023年6月22日 18:20:18go评论159阅读模式

英文:

Python ANN - Getting High Accuracy Results Even with over-complex model

问题

如何证明模型没有过拟合？
是否应该对所有分类数据进行编码，比如“性别”，还是应该只移除该特征？
在缩放数据集时，是否应该对标签编码的分类数据进行缩放？
数据集不平衡会影响分类的实现方式吗？是否应该进行平衡处理？

我尝试了不同的选项，通过试错的方式，但我不知道如何正确实现这个多标签分类问题。我还尝试了实现一个过于复杂的模型来查看过拟合情况，但奇怪的是，我获得了更好的结果。我想知道是否有人能在这里帮助我。

英文:

I am newbie to Python Machine Learning and I have an ANN task at hand. I have developed model using Keras library and sets of different layers in terms of Activation Function & Number Of Neurons.

The Task is to classify dataset which has 11 features and the output column has categorical data which consists of three classes. (Multilabel Classification I think).

The problem is the performance of model exceeds my expectations and it yields performance metrics as following :

F1 score : 0.918
Accuracy : 0.924
Precision : 0.897
Recall : 0.924

How can I prove the model is not overfitting?
Should encode all categorical data such as "Gender" or should I just remove the feature?
And when scaling dataset , should scale the Label-Encoded Categorical Data too?
With dataset being imbalance, does it affect the way to implement classification? and should I balance it?

I have tried different options with trial & error but I don't know the correct way to implement this Multilabel Classification.

I have also tried implementing an over-complex model to see how overfitting happens but strange thing is I got better results.

I was wondering if anyone could help me here.

答案1

得分: 1

为了回答您的问题：

您应该使用训练/验证/测试数据拆分。

训练数据集显然用于训练 - 这是用于损失计算的数据。您的模型正在优化以在此数据集上获得最佳分数。

验证数据用于验证您的模型在训练期间如何在一些未见数据上表现 - 这是您用于检查模型是否过拟合或欠拟合的数据拆分。您需要调整您的模型（优化超参数如学习率、层数等），主要关注验证数据集上的分数。您尝试以一种方式构建您的模型，以获得此数据集上的最佳分数。

验证指标在训练周期结束时计算，与训练指标相同。这允许您查看验证指标何时达到平稳状态或开始变差（模型过拟合）。在训练数据上，您应该在每个周期都看到改善，因为这是您的模型应该执行的操作（优化训练数据上的损失）。

过拟合示例

如上图所示，橙色线（验证）开始出现更差的损失，而训练损失仍然在改善。这是过拟合的示例。在大多数情况下，对于小型数据或简单任务，您将很快看到过拟合的情况。

您可以轻松在Keras中设置带有验证的训练，并跟踪训练和验证指标。您可以直接在model.fit()方法中使用validation_split参数来指定用于验证的数据量（例如，validation_split=0.2将使用20%的数据进行验证）。

测试数据集用于在模型开发过程中未使用的数据部分上测试模型。模型未经过训练，您也不试图调整参数以在测试数据上获得最高分数。它应该代表模型部署时获取新数据的一些真实场景。只有在完成模型开发后才应该使用它 - 因此您已经调整了参数以在验证数据集上获得最佳分数。测试损失对比较不同模型很有用。

这取决于情况。要使用此数据，您显然应该对其进行编码。然而，这个特征对于您的模型性能可能根本不起作用。有一些检查未来重要性的技术。在您的情况下，您可以检查这些特征与目标或其他特征的相关性。这是机器学习中非常流行的话题，您可以轻松找到一些很棒的材料。
通过缩放，您是指归一化/标准化吗？如果是这样，您显然可以进行测试，但对于神经网络和0/1特征（如性别），这可能不是太大的问题。
数据集不平衡对于分类任务是一个非常严重的问题。在这种情况下，使用适当的指标来评估模型非常重要。典型的例子是癌症检测。如果您的数据集包含照片，其中99%的照片是健康人，只有1%的照片包含癌症病例，那么您的模型可能总是预测某人是健康的，并且仍然获得99%的准确率，尽管它从未检测出某人患有癌症。您已经在使用精度和召回率，并且您的分数不错，所以可能不是这种情况。有时，如果您提供更平衡的数据，您的分类器将更容易学习，您可以从TensorFlow团队的此教程中了解如何训练这样的分类器。

这略微超出了您的问题范围，但事实是神经网络有时并不是最好的工具。对于表格数据，有许多不同的方法可供选择，最流行的方法之一是XGBoost或LightGBM。您甚至可以从简单的sklearn中的随机森林或决策树开始。您可能想尝试它们来解决您的问题。然而，我的答案中的大部分信息仍然适用于训练它们。

英文:

To answer your questions:

You should use train/validation/test split.

Train split is obviously used for training - it's the data that is used for loss calculation. Your model is optimizing to get the best score on this dataset.

Validation is used for validating how your model would perform on some unseen data during training - it's the split you use for checking if your model is not under- or overfitting. You fiddle with your model (optimizing hyperparameters like learning rate, number of layers etc.) looking mostly at scores on validation split. You are trying to build your model in a way that gets the best scores on this split.

Validation metrics are calculated on training epoch end, same as for training metrics. This allows you to see when validation metrics hit plateau or start getting worse (model is overfitting). On training data you should see improvement every epoch, because that's what your model is supposed to do (optimize loss on training data).

Overfitting example

As you can in the picture above, orange line (validation) starts getting worse loss, when training loss is still getting better. This is an example of overfitting. In the most settings, you will see overfitting quite fast for small-ish data or easy tasks.

You can easily setup training with validation in Keras and track both train and validation metrics. You can use validation_split argument in model.fit() method directly to specify how much of your data should be used for validation (f.e. validation_split=0.2 will use 20% of your data for validation).

Test dataset is used for testing model on the part of data that you don't use during model development. Model is not trained on it and you are not trying to tune your parameters for the highest score on test data. It's supposed to represent some real-life scenario for getting new data when your model is deployed. You should only use it when you are done with developing your model - so you have tuned your parameters to get the best scores on validation dataset. Test loss is useful for comparing different models.

It depends. To make any use of this data you should obviously encode it. However, it might turn out, that this feature is not useful at all for your model performance. There are techniques for checking future importance. In your case you could check correlation of those features with your target or other features. It's a very popular topic in machine learning and you could easily find some great materials.
By scaling you mean normalizing/standardizing? If so, you could obviously test it, but it shouldn't be that much of an issue with neural network and 0/1 features like gender.
Dataset imbalance is a very serious issue for classification task. It is very important to use proper metrics to score your models in such setup. Classical example is the case with cancer detection. If your dataset consists of pictures, where 99% of contain healthy people and only 1% contains cancer, you could be in a situation where your model always predict that someone is healthy and still get 99% accuracy, even tho it never detected that someone had cancer. You are already using precision and recall and your scores are not bad, so it might be not the case. Sometimes your classifier will learn more easily if you provide it more balanced data and you can learn how train such classifier in this tutorial from the TensorFlow team.

It's a bit out of the scope of your question, but the truth is that neural networks sometimes aren't the best tool. For tabular data there are many different methods used, where the most popular ones are XGBoost or LightGBM. You could even start with the simple Random Forest or Decision Trees from sklearn. You might want to try them out for your problem. Hovewer, most of the information in my answer is still relevant for training them.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python ANN – 即使使用过于复杂的模型也能获得高准确度的结果

问题

答案1

Can't I set the number with a decimal part to "MinMoneyValidator()" and "MaxMoneyValidator()" in "MoneyField()" with Django-money?

为什么突然导入与之前完全相同的 Python 模块变得如此缓慢？

Pascal砖三角形

子类化 `Process` 以设置进程级常量

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论