Random forests in TensorFlow and Keras: why categorical variables are treated in a different way if there are just two values?

huangapple go评论52阅读模式
英文:

Random forests in TensorFlow and Keras: why categorical variables are treated in a different way if there are just two values?

问题

I will only translate the provided text:

我希望使用TensorFlow/Keras对一些分类变量进行随机森林预测。我期望输出应该是一个概率向量,如果至少有三个可能的输出值,那就是这样。令人惊讶的是,如果可能的值集合只包含两个元素,答案似乎是一个单一数字

如果我不必单独处理这种特殊情况,我的代码将会更简单,所以这是我的问题:为什么TensorFlow以特殊方式处理这个问题,是否有一种方法可以统一处理这两种情况?

以下是最小示例:

import tensorflow as tf
import keras
import tensorflow_decision_forests as tfdf
import pandas as pd

def train_and_predict(multi, values):
    train_pd = pd.DataFrame([{"key": v, "value": v} for _ in range(multi) for v in values])
    train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd, label="value")

    rf = tfdf.keras.RandomForestModel()
    rf.fit(x=train_tf)

    to_guess = pd.DataFrame([{"key": v} for v in values])
    guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
    return rf.predict(guess_tf)

print(train_and_predict(500, [3, 5]))
print(train_and_predict(500, ["a", "b"]))
print(train_and_predict(500, ["a", "b", "c"]))

对于输入:

print(train_and_predict(500, [3, 5]))

我们得到了预期的概率向量:

[[0.         0.         0.         0.99999917 0.         0.        ]
 [0.         0.         0.         0.         0.         0.99999917]]

不幸的是,对于分类变量:

print(train_and_predict(500, ["a", "b"]))

我们得到了单一数字作为答案:

[[0.        ]
 [0.99999917]]

而如果至少有三个可能的值:

print(train_and_predict(500, ["a", "b", "c"]))

我们得到了一个漂亮的概率列表:

[[0.99999917 0.         0.        ]
 [0.         0.99999917 0.        ]
 [0.         0.         0.99999917]]

我在Kaggle上使用TensorFlow和Keras版本'2.11.0'。

英文:

I would like to make predictions for some categorical variables using random forests in TensorFlow / Keras. I would expect that the output should be a vector of probabilities, and it is the case if there are at least three possible output values. Surprisingly, the answer seems to be a single number if the set of possible values consists of just two elements.

My code would be easier if I do not have to treat such special cases separately, so here is my question: why TensorFlow treats this in a special way and is there some way to treat both cases in a uniform way?

Below you can find the minimal example.

import tensorflow as tf
import keras
import tensorflow_decision_forests as tfdf
import pandas as pd


def train_and_predict(multi, values):
    train_pd=pd.DataFrame( [ {"key":v, "value":v} for _ in range(multi) for v in values ] )
    train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd,  label= "value")

    rf = tfdf.keras.RandomForestModel()
    rf.fit(x=train_tf)

    to_guess =pd.DataFrame( [ {"key":v} for v in values]  )
    guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
    return rf.predict(guess_tf )


print(train_and_predict(500, ["a", "b"]      ) )
print(train_and_predict(500, ["a", "b", "c"] ) )

For the input

print(train_and_predict(500, [3, 5] )

we get, as expected, a vector of probabilities:

[[0.         0.         0.         0.99999917 0.         0.        ]
 [0.         0.         0.         0.         0.         0.99999917]]

Unfortunately, for categorical variables

print(train_and_predict(500, ["a", "b"] )

we get a single number as the answer

[[0.        ]
 [0.99999917]]

while if there are at least three possible values:

print(train_and_predict(500, ["a", "b", "c"] )

we get a nice list of probabilities:

[[0.99999917 0.         0.        ]
 [0.         0.99999917 0.        ]
 [0.         0.         0.99999917]]

I use TensorFlow and Keras version '2.11.0', on Kaggle.

答案1

得分: 2

简短回答: 如果你的分类问题(带有字符串标签)只有两个标签值(即二元分类),TF-DF 只会输出正类别的概率 p,即按字典顺序较大的那个标签。另一个标签的概率可以通过 1-p 计算得到。

详情:

字符串标签: Keras 并不原生支持字符串标签 - 对于 Keras,标签必须是(正)整数。由于 TF-DF 是通过 Keras API 使用的,函数 tfdf.keras.pd_dataframe_to_tf_dataset() 通过对标签列中的字符串进行排序并分配标签 0,1,..., n-1(其中 n 是标签列中唯一值的数量)来将字符串转换为整数。

对于 n=2,该问题被识别为二元分类问题,TF-DF 仅输出正类别(映射为 1 的类别)的概率,即按照 Python 对字符串值的排序,第二个字符串对应的类别。

整数标签: 如果你的标签已经是整数,tfdf.keras.pd_dataframe_to_tf_dataset() 不会对它们进行修改。TF-DF 也将整数标签识别为“已经整数化”,不会应用任何映射以节省空间/复杂性。相反,它假设可能的标签为 0,1, ..., max_label,其中 max_label 是标签列中最大标签的值。因此,输出向量的维度为 max_label+1。如果你的标签是 [0,1],你也会看到只返回标签 1 的概率。

全面披露:我是 Tensorflow Decision Forests 的作者之一。

英文:

Short answer: If your classification problem (with string labels) just has two values in the label (i.e. binary classification), TF-DF only outputs the probability p of the positive label, i.e. the one with larger lexicographical order. The probability for the other label can be computed with 1-p.

Details:

String Labels: Keras does not support string labels natively - for Keras, Labels have to be (positive) integers. Since TF-DF is used through the Keras API, the function tfdf.keras.pd_dataframe_to_tf_dataset() converts the strings in the label column to integers by sorting them and assigning labels 0,1,..., n-1 where n is the number of unique values in the label column.

For n=2, the problem is recognized as a binary classification problem and TF-DF outputs only the probability of the positive class (the one mapped to 1), that is, the second string according to Python's sorting of the string values.

Integer labels: If your labels are already integers, tfdf.keras.pd_dataframe_to_tf_dataset() does not modify them. TF-DF also recognizes integer labels as "already integerized" and does not apply any mapping to save space / complexity. Instead, it assumes that the possible labels are 0,1, ..., max_label, where max_label is the value of the largest label in the label column. The output vector therefore has max_label+1 dimensions. If your labels are [0,1], you will also see that only the probability of the label 1 is returned.

Full Disclosure: I'm one of the authors of Tensorflow Decision Forests.

huangapple
  • 本文由 发表于 2023年5月10日 18:30:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76217346.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定