如何获取 Lime 图中编码的分类列的原始字符串值

huangapple go评论75阅读模式
英文:

How to get original string values of encoded categorical columns in Lime graph

问题

我正在尝试使用 Lime 图来进行本地可解释性工作。在构建模型之前,我对一些分类变量进行了编码。

示例数据和代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

df = pd.DataFrame({'customer_id': np.arange(1, 21),
                   'gender': np.random.choice(['male', 'female'], 20),
                   'age': np.random.randint(19, 50, 20),
                   'salary': np.random.randint(20000, 95000, 20),
                   'purchased': np.random.choice([0, 1], 20, p=[.8, .2])})

预处理:

df['gender'] = df['gender'].map({'female': 0, 'male': 1})

df['age'] = df['age'].map(lambda x: 'young' if x <= 35 else 'middle aged')

df['age'] = df['age'].map({'young': 0, 'middle aged': 1})

bins = [0, df['salary'].quantile(q=.33), df['salary'].quantile(q=.66), df['salary'].quantile(q=1) + 1]
labels = ['low salary', 'medium salary', 'high salary']
df['salary'] = pd.cut(df['salary'], bins=bins, labels=labels)

from sklearn import preprocessing
l_encoder = {}
label_encoder = preprocessing.LabelEncoder()
df['salary'] = label_encoder.fit_transform(df['salary'])

接下来是从数据中提取输入和输出,以及将customer_id列分离出来:

# 输入
x = df.iloc[:, :-1]

# 输出
y = df.iloc[:, 4]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

X_train_cust = X_train.pop('customer_id')
X_test_cust = X_test.pop('customer_id')

然后,拟合逻辑回归模型:

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

最后,构建 Lime 图表并获取原始值:

import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                    feature_names=X_train.columns,
                                    verbose=True, mode='classification')

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

上述代码中的Lime图表显示了编码后的特征/列值。如果您需要显示原始值,例如,如果Lime图表显示为0,您需要将其显示为female。要实现这一点,您可以创建一个映射,将编码值映射回原始值。这样,您可以根据需要显示原始值。

英文:

I am trying to work on local explainability using Lime graph. Before building the model, I encode some of the categorical variables.

Sample Data and code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

df = pd.DataFrame({&#39;customer_id&#39; : np.arange(1,21),
                  &#39;gender&#39; : np.random.choice([&#39;male&#39;,&#39;female&#39;], 20),
                  &#39;age&#39; : np.random.randint(19,50, 20),
                  &#39;salary&#39; : np.random.randint(20000,95000, 20),
                  &#39;purchased&#39; : np.random.choice([0,1], 20, p = [.8,.2])})

Preprocessing:

df[&#39;gender&#39;] = df[&#39;gender&#39;].map({&#39;female&#39; : 0, &#39;male&#39; : 1})

df[&#39;age&#39;] = df[&#39;age&#39;].map(lambda x : &#39;young&#39; if x&lt;=35 else &#39;middle aged&#39;)

df[&#39;age&#39;] = df[&#39;age&#39;].map({&#39;young&#39; : 0, &#39;middle aged&#39; : 1})

bins = [0, df[&#39;salary&#39;].quantile(q=.33),df[&#39;salary&#39;].quantile(q=.66),df[&#39;salary&#39;].quantile(q=1)+1]
labels = [&#39;low salary&#39;, &#39;medium salary&#39;, &#39;high salary&#39;]
df[&#39;salary&#39;] = pd.cut(df[&#39;salary&#39;], bins = bins, labels=labels)

from sklearn import preprocessing
l_encoder={}
label_encoder = preprocessing.LabelEncoder()
df[&#39;salary&#39;]= label_encoder.fit_transform(df[&#39;salary&#39;])
df

	customer_id	gender	age	salary	purchased
0	1	        0	    0	1	    0
1	2	        0	    0	0	    0
2	3	        0	    1	2	    0
3	4	        1	    0	0	    0
4	5	        1	    1	2	    0
5	6	        0	    1	1	    0
6	7	        1	    0	2	    0
7	8	        1	    1	0	    0
8	9	        1	    1	1	    0
9	10	        1	    0	0	    0
10	11	        0	    1	0	    0
11	12	        0	    0	1	    0
12	13	        1	    1	1	    0
13	14	        1	    1	1	    0
14	15	        1	    1	2	    1
15	16	        1   	1	0	    0
16	17	        1	    1	1	    0
17	18	        0	    0	0   	0
18	19       	0	    0	2   	0
19	20	        0	    0	2   	0


# input
x = df.iloc[:, :-1]
  
# output
y = df.iloc[:, 4]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

Separating the customer_id column:

X_train_cust = X_train.pop(&#39;customer_id&#39;)
X_test_cust = X_test.pop(&#39;customer_id&#39;)

Fitting a logistic regression model:

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

Building a lime chart:

import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                                  verbose=True, mode = &#39;classification&#39;)

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

如何获取 Lime 图中编码的分类列的原始字符串值

The lime chart displays the encoded features/columns values. But I need the original value. For example, if the lime chart says 0, I need to display it as female.
Could someone please let me know how fix it.

答案1

得分: 1

你可以使用以下代码:

# 直接映射字典
dmap = {'gender': {'female': 0, 'male': 1},
        'age': {'young': 0, 'middle aged': 1},
        'salary': {'low salary': 0, 'medium salary': 1, 'high salary': 2}}

# 反向映射字典(在此未使用)
rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}

# 分类名称,col0->gender,col1->age,col2->salary
cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}

# 现在使用
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                categorical_features=[0, 1, 2],  # <- 前3列
                                categorical_names=cmap,  # <- 整数到字符串
                                verbose=True, mode='classification')

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

如何获取 Lime 图中编码的分类列的原始字符串值

阅读此教程

英文:

You can use:

# Your direct mapping dictionary
dmap = {&#39;gender&#39;: {&#39;female&#39; : 0, &#39;male&#39; : 1},
        &#39;age&#39;: {&#39;young&#39; : 0, &#39;middle aged&#39; : 1},
        &#39;salary&#39;: {&#39;low salary&#39;: 0, &#39;medium salary&#39;: 1, &#39;high salary&#39;: 2}}

# Reverse mapping dictionary (not used hear)
rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}

# Categorical names, col0-&gt;gender, col1-&gt;age, col2-&gt;salary
cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}


# Now use
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                categorical_features=[0, 1, 2],  # &lt;- 3 first columns
                                categorical_names=cmap,  # &lt;- int to string
                                verbose=True, mode = &#39;classification&#39;)

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

如何获取 Lime 图中编码的分类列的原始字符串值

Reat this tutorial

huangapple
  • 本文由 发表于 2023年2月6日 15:29:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358443.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定