英文:
Random forests in TensorFlow and Keras: why categorical variables are treated in a different way if there are just two values?
问题
I will only translate the provided text:
我希望使用TensorFlow/Keras对一些分类变量进行随机森林预测。我期望输出应该是一个概率向量,如果至少有三个可能的输出值,那就是这样。令人惊讶的是,如果可能的值集合只包含两个元素,答案似乎是一个单一数字。
如果我不必单独处理这种特殊情况,我的代码将会更简单,所以这是我的问题:为什么TensorFlow以特殊方式处理这个问题,是否有一种方法可以统一处理这两种情况?
以下是最小示例:
import tensorflow as tf
import keras
import tensorflow_decision_forests as tfdf
import pandas as pd
def train_and_predict(multi, values):
train_pd = pd.DataFrame([{"key": v, "value": v} for _ in range(multi) for v in values])
train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd, label="value")
rf = tfdf.keras.RandomForestModel()
rf.fit(x=train_tf)
to_guess = pd.DataFrame([{"key": v} for v in values])
guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
return rf.predict(guess_tf)
print(train_and_predict(500, [3, 5]))
print(train_and_predict(500, ["a", "b"]))
print(train_and_predict(500, ["a", "b", "c"]))
对于输入:
print(train_and_predict(500, [3, 5]))
我们得到了预期的概率向量:
[[0. 0. 0. 0.99999917 0. 0. ]
[0. 0. 0. 0. 0. 0.99999917]]
不幸的是,对于分类变量:
print(train_and_predict(500, ["a", "b"]))
我们得到了单一数字作为答案:
[[0. ]
[0.99999917]]
而如果至少有三个可能的值:
print(train_and_predict(500, ["a", "b", "c"]))
我们得到了一个漂亮的概率列表:
[[0.99999917 0. 0. ]
[0. 0.99999917 0. ]
[0. 0. 0.99999917]]
我在Kaggle上使用TensorFlow和Keras版本'2.11.0'。
英文:
I would like to make predictions for some categorical variables using random forests in TensorFlow / Keras. I would expect that the output should be a vector of probabilities, and it is the case if there are at least three possible output values. Surprisingly, the answer seems to be a single number if the set of possible values consists of just two elements.
My code would be easier if I do not have to treat such special cases separately, so here is my question: why TensorFlow treats this in a special way and is there some way to treat both cases in a uniform way?
Below you can find the minimal example.
import tensorflow as tf
import keras
import tensorflow_decision_forests as tfdf
import pandas as pd
def train_and_predict(multi, values):
train_pd=pd.DataFrame( [ {"key":v, "value":v} for _ in range(multi) for v in values ] )
train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd, label= "value")
rf = tfdf.keras.RandomForestModel()
rf.fit(x=train_tf)
to_guess =pd.DataFrame( [ {"key":v} for v in values] )
guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
return rf.predict(guess_tf )
print(train_and_predict(500, ["a", "b"] ) )
print(train_and_predict(500, ["a", "b", "c"] ) )
For the input
print(train_and_predict(500, [3, 5] )
we get, as expected, a vector of probabilities:
[[0. 0. 0. 0.99999917 0. 0. ]
[0. 0. 0. 0. 0. 0.99999917]]
Unfortunately, for categorical variables
print(train_and_predict(500, ["a", "b"] )
we get a single number as the answer
[[0. ]
[0.99999917]]
while if there are at least three possible values:
print(train_and_predict(500, ["a", "b", "c"] )
we get a nice list of probabilities:
[[0.99999917 0. 0. ]
[0. 0.99999917 0. ]
[0. 0. 0.99999917]]
I use TensorFlow and Keras version '2.11.0', on Kaggle.
答案1
得分: 2
简短回答: 如果你的分类问题(带有字符串标签)只有两个标签值(即二元分类),TF-DF 只会输出正类别的概率 p,即按字典顺序较大的那个标签。另一个标签的概率可以通过 1-p 计算得到。
详情:
字符串标签: Keras 并不原生支持字符串标签 - 对于 Keras,标签必须是(正)整数。由于 TF-DF 是通过 Keras API 使用的,函数 tfdf.keras.pd_dataframe_to_tf_dataset()
通过对标签列中的字符串进行排序并分配标签 0,1,..., n-1(其中 n 是标签列中唯一值的数量)来将字符串转换为整数。
对于 n=2,该问题被识别为二元分类问题,TF-DF 仅输出正类别(映射为 1 的类别)的概率,即按照 Python 对字符串值的排序,第二个字符串对应的类别。
整数标签: 如果你的标签已经是整数,tfdf.keras.pd_dataframe_to_tf_dataset()
不会对它们进行修改。TF-DF 也将整数标签识别为“已经整数化”,不会应用任何映射以节省空间/复杂性。相反,它假设可能的标签为 0,1, ..., max_label,其中 max_label 是标签列中最大标签的值。因此,输出向量的维度为 max_label+1。如果你的标签是 [0,1],你也会看到只返回标签 1 的概率。
全面披露:我是 Tensorflow Decision Forests 的作者之一。
英文:
Short answer: If your classification problem (with string labels) just has two values in the label (i.e. binary classification), TF-DF only outputs the probability p of the positive label, i.e. the one with larger lexicographical order. The probability for the other label can be computed with 1-p.
Details:
String Labels: Keras does not support string labels natively - for Keras, Labels have to be (positive) integers. Since TF-DF is used through the Keras API, the function tfdf.keras.pd_dataframe_to_tf_dataset()
converts the strings in the label column to integers by sorting them and assigning labels 0,1,..., n-1 where n is the number of unique values in the label column.
For n=2, the problem is recognized as a binary classification problem and TF-DF outputs only the probability of the positive class (the one mapped to 1), that is, the second string according to Python's sorting of the string values.
Integer labels: If your labels are already integers, tfdf.keras.pd_dataframe_to_tf_dataset()
does not modify them. TF-DF also recognizes integer labels as "already integerized" and does not apply any mapping to save space / complexity. Instead, it assumes that the possible labels are 0,1, ..., max_label, where max_label is the value of the largest label in the label column. The output vector therefore has max_label+1 dimensions. If your labels are [0,1], you will also see that only the probability of the label 1 is returned.
Full Disclosure: I'm one of the authors of Tensorflow Decision Forests.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论