2023年7月6日 15:51:28go评论112阅读模式

英文:

SKlearn classifier's predict_proba doesn't sum to 1

问题

我有一个分类器（在这种情况下，它是sklearn.MLPClassifier），我试图进行18个类别的分类。

因此，这个类别是多类别，而不是多标签。我只尝试预测一个单一的类别。

我有我的训练数据：X X.shape = (103393, 300) 和 Y Y.shape = (103393, 18)，其中目标 Y 是一个独热编码向量，表示目标类别。

根据 @Dr. Snoopy 的回应，我不提供任何标签 - 我只是传递了一个18维的向量，其中正确类别的索引对应于向量中的1，而其他都是0（独热编码向量）。
为了证明向量被正确地独热编码，我可以运行

import pandas as pd
pd.DataFrame(Y.sum(axis=1)).value_counts()

这会返回103393个1的计数。向量在检查时确实被正确地独热编码了。

当我拟合模型并返回所有类别的类别概率时，概率向量不会总和为1。可能的原因是什么？

这是我运行拟合的示例：

from sklearn.neural_network import MLPClassifier
X_train, Y_train, X_test, Y_test = get_data()
model = MLPClassifier(max_iter=10000)
model.fit(X_train, Y_train)
probability_vector = model.predict_proba(X_test[0, :].reshape(1, -1))

有些时候，输出非常接近1。我怀疑错误可能是由于四舍五入引起的。

在其他情况下，输出总和约为0.5或更少。示例输出：

probability_vector = list(model.predict_proba(X_test[301,:].reshape(1,-1))[0])
print(probability_vector)
>> [1.7591416e-06,
 3.148203e-05,
 3.9732524e-05,
 0.3810972,
 0.059248358,
 0.00032832936,
 8.5996935e-06,
 9.0914684e-05,
 9.377927e-07,
 0.0007674346,
 1.5543707e-06,
 0.0008467222,
 0.009655427,
 2.5728454e-05,
 1.07812774e-07,
 0.00022920035,
 0.00050288404,
 0.013878004]
len(probability_vector)
>> 18
sum(probability_vector)
>> 0.46675437349917814

为什么会发生这种情况？我的模型初始化是否不正确？

注意：错误的一些可能原因及我的评论：

类别不平衡：数据集中的类别确实不平衡。但是，非1的总和问题也发生在好表示的类别中，而不仅仅是不平衡的类别。这可能是模型表达能力不足的结果吗？
模型不确定性：“模型可能对每个输入的预测没有高度的信心。” 这是唯一的原因吗？

英文:

I have a classifier (in this case, it is the sklearn.MLPClassifier), with which I'm trying to perform classification into one of 18 classes.

The class is thus multi-class, not multi-label. I'm trying to predict only a single class.

I have my training data: X X.shape = (103393, 300) and Y Y.shape = (103393, 18), where the target Y is a one-hot encoded vetctor, denoting the target class.

> EDIT in response to @Dr. Snoopy: I do not supply any labels -- I simply pass the 18-dimensional vector with the corret class' index corresponding to the 1 in the vector, and all others being 0 (One hot encoded vector).
> To prove that the vectors are correctly 1-hot encoded, I can run

import pandas as pd
pd.DataFrame(Y.sum(axis=1)).value_counts()

>This returns 103393 counts of 1. Vectors are correctly 1-hot encoded, even upon examination.

When I fit the model, and return the class probability for all classes, the probability vector does not sum up to 1. Why might that be?

Here is an example of how I run the fitting:

from sklearn.neural_network import MLPClassifier
X_train, Y_train, X_test, Y_test = get_data()
model = MLPClassifier(max_iter=10000)
model.fit(X_train,Y_train)
probability_vector = model.predict_proba(X_test[0, :].respahe(1,-1))

Some of the time, the outputs are pretty close to 1. I suspect the error is probably due to rounding.

In other cases, the outputs sum to ~0.5 or less. Example output:

probability_vector = list(model.predict_proba(X_test[301,:].reshape(1,-1))[0])
print(probability_vector)
&gt;&gt;&gt; [1.7591416e-06,
 3.148203e-05,
 3.9732524e-05,
 0.3810972,
 0.059248358,
 0.00032832936,
 8.5996935e-06,
 9.0914684e-05,
 9.377927e-07,
 0.0007674346,
 1.5543707e-06,
 0.0008467222,
 0.009655427,
 2.5728454e-05,
 1.07812774e-07,
 0.00022920035,
 0.00050288404,
 0.013878004]
len(probability_vecto)
&gt;&gt;&gt; 18
sum(probability_vector)
&gt;&gt;&gt; 0.46675437349917814

Why might this be happening? Is my model initialized incorrectly?

>
>
>
> Note: A couple of possible reasons for the error & my comments on them:
>
> - Class imbalance: The classes in the dataset are indeed, imbalanced. However, the non-1 summation problem is happening in well represented classes too, not just the underrepresented ones. Could this be a consequence of a model, which is not expressive enough?
>
> - Model uncertainty: "The model may not have a high level of confidence in its predictions for every input. " Is that all it is?

答案1

得分: 4

不要对Y进行独热编码。MLP分类器将使用LabelBinarizer为您执行此操作，然后将正确应用softmax函数。
对于多维度，它将执行多标签分类。

一些解释在文档中进行了说明。

您可以通过访问model.out_activation_或LabelBinarizer().fit(Y).y_type_来进行检查，它们应该是softmax/multiclass，但在这里它将是logistic/multilabel-indicator。

目前您获得的是各个类别的逻辑输出。

英文:

Do not one-hot encode your Y. The MLP classifier will do that for you using LabelBinarizer and then it will apply the softmax function correctly.
With multiple dimensions it will do multi-label classification.

Some explaining is done in the docs

You can check this for example by accessing model.out_activation_ or LabelBinarizer().fit(Y).y_type_ which should be softmax/multiclass , but here it will be logistic/multilabel-indicator

What you get at the moment are the logistic outputs of the individual classes.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

SKlearn分类器的predict_proba不等于1。

问题

答案1

维护线程列表并通过ID终止它们

“pip program issue in python” 可以翻译为 “Python 中的 pip 程序问题”。

使用排列生成单词

“Python ‘import foo.bar as baz’ vs ‘from foo import bar as baz'”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。