2023年2月6日 15:29:41go评论58阅读模式

英文:

How to get original string values of encoded categorical columns in Lime graph

问题

我正在尝试使用 Lime 图来进行本地可解释性工作。在构建模型之前，我对一些分类变量进行了编码。

示例数据和代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

df = pd.DataFrame({'customer_id': np.arange(1, 21),
                   'gender': np.random.choice(['male', 'female'], 20),
                   'age': np.random.randint(19, 50, 20),
                   'salary': np.random.randint(20000, 95000, 20),
                   'purchased': np.random.choice([0, 1], 20, p=[.8, .2])})

预处理：

df['gender'] = df['gender'].map({'female': 0, 'male': 1})

df['age'] = df['age'].map(lambda x: 'young' if x <= 35 else 'middle aged')

df['age'] = df['age'].map({'young': 0, 'middle aged': 1})

bins = [0, df['salary'].quantile(q=.33), df['salary'].quantile(q=.66), df['salary'].quantile(q=1) + 1]
labels = ['low salary', 'medium salary', 'high salary']
df['salary'] = pd.cut(df['salary'], bins=bins, labels=labels)

from sklearn import preprocessing
l_encoder = {}
label_encoder = preprocessing.LabelEncoder()
df['salary'] = label_encoder.fit_transform(df['salary'])

接下来是从数据中提取输入和输出，以及将customer_id列分离出来：

# 输入
x = df.iloc[:, :-1]

# 输出
y = df.iloc[:, 4]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

X_train_cust = X_train.pop('customer_id')
X_test_cust = X_test.pop('customer_id')

然后，拟合逻辑回归模型：

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

最后，构建 Lime 图表并获取原始值：

import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                    feature_names=X_train.columns,
                                    verbose=True, mode='classification')

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

上述代码中的Lime图表显示了编码后的特征/列值。如果您需要显示原始值，例如，如果Lime图表显示为0，您需要将其显示为female。要实现这一点，您可以创建一个映射，将编码值映射回原始值。这样，您可以根据需要显示原始值。

英文:

I am trying to work on local explainability using Lime graph. Before building the model, I encode some of the categorical variables.

Sample Data and code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

df = pd.DataFrame({&#39;customer_id&#39; : np.arange(1,21),
                  &#39;gender&#39; : np.random.choice([&#39;male&#39;,&#39;female&#39;], 20),
                  &#39;age&#39; : np.random.randint(19,50, 20),
                  &#39;salary&#39; : np.random.randint(20000,95000, 20),
                  &#39;purchased&#39; : np.random.choice([0,1], 20, p = [.8,.2])})

Preprocessing:

df[&#39;gender&#39;] = df[&#39;gender&#39;].map({&#39;female&#39; : 0, &#39;male&#39; : 1})

df[&#39;age&#39;] = df[&#39;age&#39;].map(lambda x : &#39;young&#39; if x&lt;=35 else &#39;middle aged&#39;)

df[&#39;age&#39;] = df[&#39;age&#39;].map({&#39;young&#39; : 0, &#39;middle aged&#39; : 1})

bins = [0, df[&#39;salary&#39;].quantile(q=.33),df[&#39;salary&#39;].quantile(q=.66),df[&#39;salary&#39;].quantile(q=1)+1]
labels = [&#39;low salary&#39;, &#39;medium salary&#39;, &#39;high salary&#39;]
df[&#39;salary&#39;] = pd.cut(df[&#39;salary&#39;], bins = bins, labels=labels)

from sklearn import preprocessing
l_encoder={}
label_encoder = preprocessing.LabelEncoder()
df[&#39;salary&#39;]= label_encoder.fit_transform(df[&#39;salary&#39;])
df

	customer_id	gender	age	salary	purchased
0	1	        0	    0	1	    0
1	2	        0	    0	0	    0
2	3	        0	    1	2	    0
3	4	        1	    0	0	    0
4	5	        1	    1	2	    0
5	6	        0	    1	1	    0
6	7	        1	    0	2	    0
7	8	        1	    1	0	    0
8	9	        1	    1	1	    0
9	10	        1	    0	0	    0
10	11	        0	    1	0	    0
11	12	        0	    0	1	    0
12	13	        1	    1	1	    0
13	14	        1	    1	1	    0
14	15	        1	    1	2	    1
15	16	        1   	1	0	    0
16	17	        1	    1	1	    0
17	18	        0	    0	0   	0
18	19       	0	    0	2   	0
19	20	        0	    0	2   	0


# input
x = df.iloc[:, :-1]
  
# output
y = df.iloc[:, 4]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

Separating the customer_id column:

X_train_cust = X_train.pop(&#39;customer_id&#39;)
X_test_cust = X_test.pop(&#39;customer_id&#39;)

Fitting a logistic regression model:

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

Building a lime chart:

import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                                  verbose=True, mode = &#39;classification&#39;)

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

The lime chart displays the encoded features/columns values. But I need the original value. For example, if the lime chart says 0, I need to display it as female.
Could someone please let me know how fix it.

答案1

得分: 1

你可以使用以下代码：

# 直接映射字典
dmap = {'gender': {'female': 0, 'male': 1},
        'age': {'young': 0, 'middle aged': 1},
        'salary': {'low salary': 0, 'medium salary': 1, 'high salary': 2}}

# 反向映射字典（在此未使用）
rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}

# 分类名称，col0->gender，col1->age，col2->salary
cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}

# 现在使用
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                categorical_features=[0, 1, 2],  # <- 前3列
                                categorical_names=cmap,  # <- 整数到字符串
                                verbose=True, mode='classification')

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

阅读此教程

英文:

You can use:

# Your direct mapping dictionary
dmap = {&#39;gender&#39;: {&#39;female&#39; : 0, &#39;male&#39; : 1},
        &#39;age&#39;: {&#39;young&#39; : 0, &#39;middle aged&#39; : 1},
        &#39;salary&#39;: {&#39;low salary&#39;: 0, &#39;medium salary&#39;: 1, &#39;high salary&#39;: 2}}

# Reverse mapping dictionary (not used hear)
rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}

# Categorical names, col0-&gt;gender, col1-&gt;age, col2-&gt;salary
cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}


# Now use
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                categorical_features=[0, 1, 2],  # &lt;- 3 first columns
                                categorical_names=cmap,  # &lt;- int to string
                                verbose=True, mode = &#39;classification&#39;)

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

Reat this tutorial

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何获取 Lime 图中编码的分类列的原始字符串值

问题

答案1

在 for 循环中更新标签文本。

如何将一个数据框按月份的天数进行分割？

在使用 `collect_list()` 后访问数值。

在Python中，字典键的组合的字典值的乘积。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论