2023年3月7日 22:47:26go评论80阅读模式

英文:

How to obtain the interval limits from a decision tree with scikit-learn?

问题

你可以通过以下方式获取类似所需的分割点数组：

import numpy as np

# 你的树节点分割描述
tree_structure = "|--- feature_0 <= 0.08\n|   |--- class: 0\n|--- feature_0 > 0.08\n|   |--- feature_0 <= 8.50\n|   |   |--- feature_0 <= 1.50\n|   |   |   |--- class: 1\n|   |   |--- feature_0 > 1.50\n|   |   |   |--- class: 1\n|   |--- feature_0 > 8.50\n|   |   |--- feature_0 <= 60.25\n|   |   |   |--- class: 0\n|   |   |--- feature_0 > 60.25\n|   |   |   |--- class: 0"

# 从树结构中提取分割点限制
limits = []
lines = tree_structure.split("\n")
for line in lines:
    if "<=" in line or ">" in line:
        limit = float(line.split()[-1])
        limits.append(limit)

# 添加负无穷和正无穷
limits.insert(0, -np.inf)
limits.append(np.inf)

# 打印结果
print(limits)

这段代码会从你的决策树结构描述中提取出分割点的限制，并添加负无穷和正无穷作为数组的第一个和最后一个元素，得到你所需的分割点数组。

英文:

Say I am using the titanic dataset, with the variable age only:

import pandas as pd

data = pd.read_csv(&#39;https://www.openml.org/data/get_csv/16826755/phpMYEkMl&#39;)[[&quot;age&quot;, &quot;survived&quot;]]
data = data.replace(&#39;?&#39;, np.nan)
data = data.fillna(0)
print(data)

the result:

         age  survived
0         29         1
1     0.9167         1
2          2         0
3         30         0
4         25         0
...      ...       ...
1304    14.5         0
1305       0         0
1306    26.5         0
1307      27         0
1308      29         0

[1309 rows x 2 columns]

Now I train a decision tree to predict survival from age:

from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(max_depth=3)
tree_model.fit(data[&#39;age&#39;].to_frame(),data[&quot;survived&quot;])

And if I print the structure of the tree:

from sklearn import tree
print(tree.export_text(tree_model))

I obtain:

|--- feature_0 &lt;= 0.08
|   |--- class: 0
|--- feature_0 &gt;  0.08
|   |--- feature_0 &lt;= 8.50
|   |   |--- feature_0 &lt;= 1.50
|   |   |   |--- class: 1
|   |   |--- feature_0 &gt;  1.50
|   |   |   |--- class: 1
|   |--- feature_0 &gt;  8.50
|   |   |--- feature_0 &lt;= 60.25
|   |   |   |--- class: 0
|   |   |--- feature_0 &gt;  60.25
|   |   |   |--- class: 0

These means that the final division for every node is:

0-0.08 ; 0.08-1.50; 1.50-8.50 ; 8.50-60; >60

My question is, how can I capture those limits in an array that looks like this:

[-np.inf, 0.08, 1.5, 8.5, 60, np.inf]

Thank you!

答案1

得分: 3

决策分类器，在这种情况下 `tree_model` 具有名为 `tree_` 的属性，允许访问底层属性。

print(tree_model.tree_.threshold)

array([ 0.08335, -2.     ,  8.5    ,  1.5    , -2.     , -2.     ,
       60.25   , -2.     , -2.     ])

print(tree_model.tree_.feature)

array([ 0, -2,  0,  0, -2, -2,  0, -2, -2], dtype=int64)

`feature` 和 `threshold` 数组仅适用于分裂节点。因此，这些数组中的叶节点的值是任意的。

要获取特征的分割/阈值，可以使用 `feature` 数组过滤阈值。
``` python
threshold = tree_model.tree_.threshold
feature = tree_model.tree_.feature
feature_threshold = threshold[feature == 0]
thresholds = sorted(feature_threshold)
print(thresholds)

[0.08335000276565552, 1.5, 8.5, 60.25]

要使用 np.inf，您需要自行添加。

thresholds = [-np.inf] + thresholds + [np.inf]
print(thresholds)

[-inf, 0.08335000276565552, 1.5, 8.5, 60.25, inf]

参考：理解决策树结构。


<details>
<summary>英文:</summary>

The decision classifier, in this case `tree_model` has an attribute called `tree_` which allows access to low level attributes.

``` python
print(tree_model.tree_.threshold)

array([ 0.08335, -2.     ,  8.5    ,  1.5    , -2.     , -2.     ,
       60.25   , -2.     , -2.     ])

print(tree_model.tree_.feature)

array([ 0, -2,  0,  0, -2, -2,  0, -2, -2], dtype=int64)

The arrays feature and threshold only apply to split nodes. The values for leaf nodes in these arrays are therefore arbitrary.

To get the division/threshold of a feature, you can filter the threshold using the feature array.

threshold = tree_model.tree_.threshold
feature = tree_model.tree_.feature
feature_threshold = threshold[feature == 0]
thresholds = sorted(feature_threshold)
print(thresholds)

[0.08335000276565552, 1.5, 8.5, 60.25]

To have np.inf, you need to add it yourself.

thresholds = [-np.inf] + thresholds + [np.inf]
print(thresholds)

[-inf, 0.08335000276565552, 1.5, 8.5, 60.25, inf]

Reference: Understanding the decision tree structure.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从使用scikit-learn创建的决策树中获取区间限制？

问题

答案1

如何使用K-Fold交叉验证与DenseNet121模型

sklearn LabelBinarizer 选择哪个标签（字符串）作为正类别。

model.predict(…)为什么始终返回相同的答案？

Python Sklearn 多元线性回归用于概率 – 将系数归一化为1

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论