英文:
How to get original string values of encoded categorical columns in Lime graph
问题
我正在尝试使用 Lime 图来进行本地可解释性工作。在构建模型之前,我对一些分类变量进行了编码。
示例数据和代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
df = pd.DataFrame({'customer_id': np.arange(1, 21),
'gender': np.random.choice(['male', 'female'], 20),
'age': np.random.randint(19, 50, 20),
'salary': np.random.randint(20000, 95000, 20),
'purchased': np.random.choice([0, 1], 20, p=[.8, .2])})
预处理:
df['gender'] = df['gender'].map({'female': 0, 'male': 1})
df['age'] = df['age'].map(lambda x: 'young' if x <= 35 else 'middle aged')
df['age'] = df['age'].map({'young': 0, 'middle aged': 1})
bins = [0, df['salary'].quantile(q=.33), df['salary'].quantile(q=.66), df['salary'].quantile(q=1) + 1]
labels = ['low salary', 'medium salary', 'high salary']
df['salary'] = pd.cut(df['salary'], bins=bins, labels=labels)
from sklearn import preprocessing
l_encoder = {}
label_encoder = preprocessing.LabelEncoder()
df['salary'] = label_encoder.fit_transform(df['salary'])
接下来是从数据中提取输入和输出,以及将customer_id
列分离出来:
# 输入
x = df.iloc[:, :-1]
# 输出
y = df.iloc[:, 4]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)
X_train_cust = X_train.pop('customer_id')
X_test_cust = X_test.pop('customer_id')
然后,拟合逻辑回归模型:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)
最后,构建 Lime 图表并获取原始值:
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
feature_names=X_train.columns,
verbose=True, mode='classification')
exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
exp.as_pyplot_figure()
上述代码中的Lime图表显示了编码后的特征/列值。如果您需要显示原始值,例如,如果Lime图表显示为0,您需要将其显示为female
。要实现这一点,您可以创建一个映射,将编码值映射回原始值。这样,您可以根据需要显示原始值。
英文:
I am trying to work on local explainability using Lime graph. Before building the model, I encode some of the categorical variables.
Sample Data and code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
df = pd.DataFrame({'customer_id' : np.arange(1,21),
'gender' : np.random.choice(['male','female'], 20),
'age' : np.random.randint(19,50, 20),
'salary' : np.random.randint(20000,95000, 20),
'purchased' : np.random.choice([0,1], 20, p = [.8,.2])})
Preprocessing:
df['gender'] = df['gender'].map({'female' : 0, 'male' : 1})
df['age'] = df['age'].map(lambda x : 'young' if x<=35 else 'middle aged')
df['age'] = df['age'].map({'young' : 0, 'middle aged' : 1})
bins = [0, df['salary'].quantile(q=.33),df['salary'].quantile(q=.66),df['salary'].quantile(q=1)+1]
labels = ['low salary', 'medium salary', 'high salary']
df['salary'] = pd.cut(df['salary'], bins = bins, labels=labels)
from sklearn import preprocessing
l_encoder={}
label_encoder = preprocessing.LabelEncoder()
df['salary']= label_encoder.fit_transform(df['salary'])
df
customer_id gender age salary purchased
0 1 0 0 1 0
1 2 0 0 0 0
2 3 0 1 2 0
3 4 1 0 0 0
4 5 1 1 2 0
5 6 0 1 1 0
6 7 1 0 2 0
7 8 1 1 0 0
8 9 1 1 1 0
9 10 1 0 0 0
10 11 0 1 0 0
11 12 0 0 1 0
12 13 1 1 1 0
13 14 1 1 1 0
14 15 1 1 2 1
15 16 1 1 0 0
16 17 1 1 1 0
17 18 0 0 0 0
18 19 0 0 2 0
19 20 0 0 2 0
# input
x = df.iloc[:, :-1]
# output
y = df.iloc[:, 4]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)
Separating the customer_id
column:
X_train_cust = X_train.pop('customer_id')
X_test_cust = X_test.pop('customer_id')
Fitting a logistic regression model:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
Building a lime
chart:
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
feature_names=X_train.columns,
verbose=True, mode = 'classification')
exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
exp.as_pyplot_figure()
The lime chart displays the encoded features/columns values. But I need the original value. For example, if the lime chart says 0, I need to display it as female
.
Could someone please let me know how fix it.
答案1
得分: 1
你可以使用以下代码:
# 直接映射字典
dmap = {'gender': {'female': 0, 'male': 1},
'age': {'young': 0, 'middle aged': 1},
'salary': {'low salary': 0, 'medium salary': 1, 'high salary': 2}}
# 反向映射字典(在此未使用)
rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}
# 分类名称,col0->gender,col1->age,col2->salary
cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}
# 现在使用
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
feature_names=X_train.columns,
categorical_features=[0, 1, 2], # <- 前3列
categorical_names=cmap, # <- 整数到字符串
verbose=True, mode='classification')
exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
exp.as_pyplot_figure()
阅读此教程
英文:
You can use:
# Your direct mapping dictionary
dmap = {'gender': {'female' : 0, 'male' : 1},
'age': {'young' : 0, 'middle aged' : 1},
'salary': {'low salary': 0, 'medium salary': 1, 'high salary': 2}}
# Reverse mapping dictionary (not used hear)
rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}
# Categorical names, col0->gender, col1->age, col2->salary
cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}
# Now use
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
feature_names=X_train.columns,
categorical_features=[0, 1, 2], # <- 3 first columns
categorical_names=cmap, # <- int to string
verbose=True, mode = 'classification')
exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
exp.as_pyplot_figure()
Reat this tutorial
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论