SKlearn分类器的predict_proba不等于1。

huangapple go评论82阅读模式
英文:

SKlearn classifier's predict_proba doesn't sum to 1

问题

我有一个分类器(在这种情况下,它是sklearn.MLPClassifier),我试图进行18个类别的分类。

因此,这个类别是多类别,而不是多标签。我只尝试预测一个单一的类别。

我有我的训练数据:X X.shape = (103393, 300) 和 Y Y.shape = (103393, 18),其中目标 Y 是一个独热编码向量,表示目标类别。

根据 @Dr. Snoopy 的回应,我不提供任何标签 - 我只是传递了一个18维的向量,其中正确类别的索引对应于向量中的1,而其他都是0(独热编码向量)。
为了证明向量被正确地独热编码,我可以运行

import pandas as pd
pd.DataFrame(Y.sum(axis=1)).value_counts()

这会返回103393个1的计数。向量在检查时确实被正确地独热编码了。

当我拟合模型并返回所有类别的类别概率时,概率向量不会总和为1。可能的原因是什么?

这是我运行拟合的示例:

from sklearn.neural_network import MLPClassifier

X_train, Y_train, X_test, Y_test = get_data()

model = MLPClassifier(max_iter=10000)
model.fit(X_train, Y_train)
probability_vector = model.predict_proba(X_test[0, :].reshape(1, -1))

有些时候,输出非常接近1。我怀疑错误可能是由于四舍五入引起的。

在其他情况下,输出总和约为0.5或更少。示例输出:

probability_vector = list(model.predict_proba(X_test[301,:].reshape(1,-1))[0])
print(probability_vector)
>> [1.7591416e-06,
 3.148203e-05,
 3.9732524e-05,
 0.3810972,
 0.059248358,
 0.00032832936,
 8.5996935e-06,
 9.0914684e-05,
 9.377927e-07,
 0.0007674346,
 1.5543707e-06,
 0.0008467222,
 0.009655427,
 2.5728454e-05,
 1.07812774e-07,
 0.00022920035,
 0.00050288404,
 0.013878004]

len(probability_vector)

>> 18

sum(probability_vector)
>> 0.46675437349917814

为什么会发生这种情况?我的模型初始化是否不正确?

注意:错误的一些可能原因及我的评论:

  • 类别不平衡:数据集中的类别确实不平衡。但是,非1的总和问题也发生在好表示的类别中,而不仅仅是不平衡的类别。这可能是模型表达能力不足的结果吗?
  • 模型不确定性:“模型可能对每个输入的预测没有高度的信心。” 这是唯一的原因吗?
英文:

I have a classifier (in this case, it is the sklearn.MLPClassifier), with which I'm trying to perform classification into one of 18 classes.

The class is thus multi-class, not multi-label. I'm trying to predict only a single class.

I have my training data: X X.shape = (103393, 300) and Y Y.shape = (103393, 18), where the target Y is a one-hot encoded vetctor, denoting the target class.

> EDIT in response to @Dr. Snoopy: I do not supply any labels -- I simply pass the 18-dimensional vector with the corret class' index corresponding to the 1 in the vector, and all others being 0 (One hot encoded vector).
> To prove that the vectors are correctly 1-hot encoded, I can run

import pandas as pd
pd.DataFrame(Y.sum(axis=1)).value_counts()

>This returns 103393 counts of 1. Vectors are correctly 1-hot encoded, even upon examination.

When I fit the model, and return the class probability for all classes, the probability vector does not sum up to 1. Why might that be?

Here is an example of how I run the fitting:

from sklearn.neural_network import MLPClassifier

X_train, Y_train, X_test, Y_test = get_data()

model = MLPClassifier(max_iter=10000)
model.fit(X_train,Y_train)
probability_vector = model.predict_proba(X_test[0, :].respahe(1,-1))

Some of the time, the outputs are pretty close to 1. I suspect the error is probably due to rounding.

In other cases, the outputs sum to ~0.5 or less. Example output:

probability_vector = list(model.predict_proba(X_test[301,:].reshape(1,-1))[0])
print(probability_vector)
>>> [1.7591416e-06,
 3.148203e-05,
 3.9732524e-05,
 0.3810972,
 0.059248358,
 0.00032832936,
 8.5996935e-06,
 9.0914684e-05,
 9.377927e-07,
 0.0007674346,
 1.5543707e-06,
 0.0008467222,
 0.009655427,
 2.5728454e-05,
 1.07812774e-07,
 0.00022920035,
 0.00050288404,
 0.013878004]

len(probability_vecto)

>>> 18

sum(probability_vector)
>>> 0.46675437349917814


Why might this be happening? Is my model initialized incorrectly?

>
>
>
> Note: A couple of possible reasons for the error & my comments on them:
>
> - Class imbalance: The classes in the dataset are indeed, imbalanced. However, the non-1 summation problem is happening in well represented classes too, not just the underrepresented ones. Could this be a consequence of a model, which is not expressive enough?
>
> - Model uncertainty: "The model may not have a high level of confidence in its predictions for every input. " Is that all it is?

答案1

得分: 4

不要对Y进行独热编码。MLP分类器将使用LabelBinarizer为您执行此操作,然后将正确应用softmax函数。
对于多维度,它将执行多标签分类。

一些解释在文档中进行了说明。

您可以通过访问model.out_activation_LabelBinarizer().fit(Y).y_type_来进行检查,它们应该是softmax/multiclass,但在这里它将是logistic/multilabel-indicator

目前您获得的是各个类别的逻辑输出。

英文:

Do not one-hot encode your Y. The MLP classifier will do that for you using LabelBinarizer and then it will apply the softmax function correctly.
With multiple dimensions it will do multi-label classification.

Some explaining is done in the docs


You can check this for example by accessing model.out_activation_ or LabelBinarizer().fit(Y).y_type_ which should be softmax/multiclass , but here it will be logistic/multilabel-indicator

What you get at the moment are the logistic outputs of the individual classes.

huangapple
  • 本文由 发表于 2023年7月6日 15:51:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76626629.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定