2023年5月10日 18:30:07go评论52阅读模式

英文:

Random forests in TensorFlow and Keras: why categorical variables are treated in a different way if there are just two values?

问题

I will only translate the provided text:

我希望使用TensorFlow/Keras对一些分类变量进行随机森林预测。我期望输出应该是一个概率向量，如果至少有三个可能的输出值，那就是这样。令人惊讶的是，如果可能的值集合只包含两个元素，答案似乎是一个单一数字。

如果我不必单独处理这种特殊情况，我的代码将会更简单，所以这是我的问题：为什么TensorFlow以特殊方式处理这个问题，是否有一种方法可以统一处理这两种情况？

以下是最小示例：

import tensorflow as tf
import keras
import tensorflow_decision_forests as tfdf
import pandas as pd

def train_and_predict(multi, values):
    train_pd = pd.DataFrame([{"key": v, "value": v} for _ in range(multi) for v in values])
    train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd, label="value")

    rf = tfdf.keras.RandomForestModel()
    rf.fit(x=train_tf)

    to_guess = pd.DataFrame([{"key": v} for v in values])
    guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
    return rf.predict(guess_tf)

print(train_and_predict(500, [3, 5]))
print(train_and_predict(500, ["a", "b"]))
print(train_and_predict(500, ["a", "b", "c"]))

对于输入：

print(train_and_predict(500, [3, 5]))

我们得到了预期的概率向量：

[[0.         0.         0.         0.99999917 0.         0.        ]
 [0.         0.         0.         0.         0.         0.99999917]]

不幸的是，对于分类变量：

print(train_and_predict(500, ["a", "b"]))

我们得到了单一数字作为答案：

[[0.        ]
 [0.99999917]]

而如果至少有三个可能的值：

print(train_and_predict(500, ["a", "b", "c"]))

我们得到了一个漂亮的概率列表：

[[0.99999917 0.         0.        ]
 [0.         0.99999917 0.        ]
 [0.         0.         0.99999917]]

我在Kaggle上使用TensorFlow和Keras版本'2.11.0'。

英文:

I would like to make predictions for some categorical variables using random forests in TensorFlow / Keras. I would expect that the output should be a vector of probabilities, and it is the case if there are at least three possible output values. Surprisingly, the answer seems to be a single number if the set of possible values consists of just two elements.

My code would be easier if I do not have to treat such special cases separately, so here is my question: why TensorFlow treats this in a special way and is there some way to treat both cases in a uniform way?

Below you can find the minimal example.

import tensorflow as tf
import keras
import tensorflow_decision_forests as tfdf
import pandas as pd


def train_and_predict(multi, values):
    train_pd=pd.DataFrame( [ {&quot;key&quot;:v, &quot;value&quot;:v} for _ in range(multi) for v in values ] )
    train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd,  label= &quot;value&quot;)

    rf = tfdf.keras.RandomForestModel()
    rf.fit(x=train_tf)

    to_guess =pd.DataFrame( [ {&quot;key&quot;:v} for v in values]  )
    guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
    return rf.predict(guess_tf )


print(train_and_predict(500, [&quot;a&quot;, &quot;b&quot;]      ) )
print(train_and_predict(500, [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;] ) )

For the input

print(train_and_predict(500, [3, 5] )

we get, as expected, a vector of probabilities:

[[0.         0.         0.         0.99999917 0.         0.        ]
 [0.         0.         0.         0.         0.         0.99999917]]

Unfortunately, for categorical variables

print(train_and_predict(500, [&quot;a&quot;, &quot;b&quot;] )

we get a single number as the answer

[[0.        ]
 [0.99999917]]

while if there are at least three possible values:

print(train_and_predict(500, [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;] )

we get a nice list of probabilities:

[[0.99999917 0.         0.        ]
 [0.         0.99999917 0.        ]
 [0.         0.         0.99999917]]

I use TensorFlow and Keras version '2.11.0', on Kaggle.

答案1

得分: 2

简短回答： 如果你的分类问题（带有字符串标签）只有两个标签值（即二元分类），TF-DF 只会输出正类别的概率 p，即按字典顺序较大的那个标签。另一个标签的概率可以通过 1-p 计算得到。

详情：

字符串标签： Keras 并不原生支持字符串标签 - 对于 Keras，标签必须是（正）整数。由于 TF-DF 是通过 Keras API 使用的，函数 tfdf.keras.pd_dataframe_to_tf_dataset() 通过对标签列中的字符串进行排序并分配标签 0,1,..., n-1（其中 n 是标签列中唯一值的数量）来将字符串转换为整数。

对于 n=2，该问题被识别为二元分类问题，TF-DF 仅输出正类别（映射为 1 的类别）的概率，即按照 Python 对字符串值的排序，第二个字符串对应的类别。

整数标签： 如果你的标签已经是整数，tfdf.keras.pd_dataframe_to_tf_dataset() 不会对它们进行修改。TF-DF 也将整数标签识别为“已经整数化”，不会应用任何映射以节省空间/复杂性。相反，它假设可能的标签为 0,1, ..., max_label，其中 max_label 是标签列中最大标签的值。因此，输出向量的维度为 max_label+1。如果你的标签是 [0,1]，你也会看到只返回标签 1 的概率。

全面披露：我是 Tensorflow Decision Forests 的作者之一。

英文:

Short answer: If your classification problem (with string labels) just has two values in the label (i.e. binary classification), TF-DF only outputs the probability p of the positive label, i.e. the one with larger lexicographical order. The probability for the other label can be computed with 1-p.

Details:

String Labels: Keras does not support string labels natively - for Keras, Labels have to be (positive) integers. Since TF-DF is used through the Keras API, the function tfdf.keras.pd_dataframe_to_tf_dataset() converts the strings in the label column to integers by sorting them and assigning labels 0,1,..., n-1 where n is the number of unique values in the label column.

For n=2, the problem is recognized as a binary classification problem and TF-DF outputs only the probability of the positive class (the one mapped to 1), that is, the second string according to Python's sorting of the string values.

Integer labels: If your labels are already integers, tfdf.keras.pd_dataframe_to_tf_dataset() does not modify them. TF-DF also recognizes integer labels as "already integerized" and does not apply any mapping to save space / complexity. Instead, it assumes that the possible labels are 0,1, ..., max_label, where max_label is the value of the largest label in the label column. The output vector therefore has max_label+1 dimensions. If your labels are [0,1], you will also see that only the probability of the label 1 is returned.

Full Disclosure: I'm one of the authors of Tensorflow Decision Forests.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Random forests in TensorFlow and Keras: why categorical variables are treated in a different way if there are just two values?

问题

答案1

TensorFlow减慢了Jupyter Notebook中的智能感知（类型提示）。

遇到在尝试调整文件中的图像大小时出现权限错误？

如何计算/测量TensorFlow模型的推理时间？

ImportError: 无法从’keras.models’导入’name’ ‘Input’

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论