英文:
SKlearn classifier's predict_proba doesn't sum to 1
问题
我有一个分类器(在这种情况下,它是sklearn.MLPClassifier),我试图进行18个类别的分类。
因此,这个类别是多类别,而不是多标签。我只尝试预测一个单一的类别。
我有我的训练数据:X X.shape = (103393, 300)
和 Y Y.shape = (103393, 18)
,其中目标 Y 是一个独热编码向量,表示目标类别。
根据 @Dr. Snoopy 的回应,我不提供任何标签 - 我只是传递了一个18维的向量,其中正确类别的索引对应于向量中的1,而其他都是0(独热编码向量)。
为了证明向量被正确地独热编码,我可以运行
import pandas as pd
pd.DataFrame(Y.sum(axis=1)).value_counts()
这会返回103393个1的计数。向量在检查时确实被正确地独热编码了。
当我拟合模型并返回所有类别的类别概率时,概率向量不会总和为1。可能的原因是什么?
这是我运行拟合的示例:
from sklearn.neural_network import MLPClassifier
X_train, Y_train, X_test, Y_test = get_data()
model = MLPClassifier(max_iter=10000)
model.fit(X_train, Y_train)
probability_vector = model.predict_proba(X_test[0, :].reshape(1, -1))
有些时候,输出非常接近1。我怀疑错误可能是由于四舍五入引起的。
在其他情况下,输出总和约为0.5或更少。示例输出:
probability_vector = list(model.predict_proba(X_test[301,:].reshape(1,-1))[0])
print(probability_vector)
>> [1.7591416e-06,
3.148203e-05,
3.9732524e-05,
0.3810972,
0.059248358,
0.00032832936,
8.5996935e-06,
9.0914684e-05,
9.377927e-07,
0.0007674346,
1.5543707e-06,
0.0008467222,
0.009655427,
2.5728454e-05,
1.07812774e-07,
0.00022920035,
0.00050288404,
0.013878004]
len(probability_vector)
>> 18
sum(probability_vector)
>> 0.46675437349917814
为什么会发生这种情况?我的模型初始化是否不正确?
注意:错误的一些可能原因及我的评论:
- 类别不平衡:数据集中的类别确实不平衡。但是,非1的总和问题也发生在好表示的类别中,而不仅仅是不平衡的类别。这可能是模型表达能力不足的结果吗?
- 模型不确定性:“模型可能对每个输入的预测没有高度的信心。” 这是唯一的原因吗?
英文:
I have a classifier (in this case, it is the sklearn.MLPClassifier), with which I'm trying to perform classification into one of 18 classes.
The class is thus multi-class, not multi-label. I'm trying to predict only a single class.
I have my training data: X X.shape = (103393, 300)
and Y Y.shape = (103393, 18)
, where the target Y is a one-hot encoded vetctor, denoting the target class.
> EDIT in response to @Dr. Snoopy: I do not supply any labels -- I simply pass the 18-dimensional vector with the corret class' index corresponding to the 1 in the vector, and all others being 0 (One hot encoded vector).
> To prove that the vectors are correctly 1-hot encoded, I can run
import pandas as pd
pd.DataFrame(Y.sum(axis=1)).value_counts()
>This returns 103393 counts of 1. Vectors are correctly 1-hot encoded, even upon examination.
When I fit the model, and return the class probability for all classes, the probability vector does not sum up to 1. Why might that be?
Here is an example of how I run the fitting:
from sklearn.neural_network import MLPClassifier
X_train, Y_train, X_test, Y_test = get_data()
model = MLPClassifier(max_iter=10000)
model.fit(X_train,Y_train)
probability_vector = model.predict_proba(X_test[0, :].respahe(1,-1))
Some of the time, the outputs are pretty close to 1. I suspect the error is probably due to rounding.
In other cases, the outputs sum to ~0.5 or less. Example output:
probability_vector = list(model.predict_proba(X_test[301,:].reshape(1,-1))[0])
print(probability_vector)
>>> [1.7591416e-06,
3.148203e-05,
3.9732524e-05,
0.3810972,
0.059248358,
0.00032832936,
8.5996935e-06,
9.0914684e-05,
9.377927e-07,
0.0007674346,
1.5543707e-06,
0.0008467222,
0.009655427,
2.5728454e-05,
1.07812774e-07,
0.00022920035,
0.00050288404,
0.013878004]
len(probability_vecto)
>>> 18
sum(probability_vector)
>>> 0.46675437349917814
Why might this be happening? Is my model initialized incorrectly?
>
>
>
> Note: A couple of possible reasons for the error & my comments on them:
>
> - Class imbalance: The classes in the dataset are indeed, imbalanced. However, the non-1 summation problem is happening in well represented classes too, not just the underrepresented ones. Could this be a consequence of a model, which is not expressive enough?
>
> - Model uncertainty: "The model may not have a high level of confidence in its predictions for every input. " Is that all it is?
答案1
得分: 4
不要对Y进行独热编码。MLP分类器将使用LabelBinarizer为您执行此操作,然后将正确应用softmax函数。
对于多维度,它将执行多标签分类。
一些解释在文档中进行了说明。
您可以通过访问model.out_activation_
或LabelBinarizer().fit(Y).y_type_
来进行检查,它们应该是softmax/multiclass,但在这里它将是logistic/multilabel-indicator。
目前您获得的是各个类别的逻辑输出。
英文:
Do not one-hot encode your Y. The MLP classifier will do that for you using LabelBinarizer and then it will apply the softmax function correctly.
With multiple dimensions it will do multi-label classification.
Some explaining is done in the docs
You can check this for example by accessing model.out_activation_
or LabelBinarizer().fit(Y).y_type_
which should be softmax/multiclass , but here it will be logistic/multilabel-indicator
What you get at the moment are the logistic outputs of the individual classes.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论