如何获取 Lime 图中编码的分类列的原始字符串值

huangapple go评论102阅读模式
英文:

How to get original string values of encoded categorical columns in Lime graph

问题

我正在尝试使用 Lime 图来进行本地可解释性工作。在构建模型之前,我对一些分类变量进行了编码。

示例数据和代码:

  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. from sklearn.model_selection import train_test_split
  5. from sklearn.preprocessing import StandardScaler
  6. from sklearn.linear_model import LogisticRegression
  7. from sklearn.metrics import confusion_matrix
  8. df = pd.DataFrame({'customer_id': np.arange(1, 21),
  9. 'gender': np.random.choice(['male', 'female'], 20),
  10. 'age': np.random.randint(19, 50, 20),
  11. 'salary': np.random.randint(20000, 95000, 20),
  12. 'purchased': np.random.choice([0, 1], 20, p=[.8, .2])})

预处理:

  1. df['gender'] = df['gender'].map({'female': 0, 'male': 1})
  2. df['age'] = df['age'].map(lambda x: 'young' if x <= 35 else 'middle aged')
  3. df['age'] = df['age'].map({'young': 0, 'middle aged': 1})
  4. bins = [0, df['salary'].quantile(q=.33), df['salary'].quantile(q=.66), df['salary'].quantile(q=1) + 1]
  5. labels = ['low salary', 'medium salary', 'high salary']
  6. df['salary'] = pd.cut(df['salary'], bins=bins, labels=labels)
  7. from sklearn import preprocessing
  8. l_encoder = {}
  9. label_encoder = preprocessing.LabelEncoder()
  10. df['salary'] = label_encoder.fit_transform(df['salary'])

接下来是从数据中提取输入和输出,以及将customer_id列分离出来:

  1. # 输入
  2. x = df.iloc[:, :-1]
  3. # 输出
  4. y = df.iloc[:, 4]
  5. from sklearn.model_selection import train_test_split
  6. X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)
  7. X_train_cust = X_train.pop('customer_id')
  8. X_test_cust = X_test.pop('customer_id')

然后,拟合逻辑回归模型:

  1. from sklearn.linear_model import LogisticRegression
  2. classifier = LogisticRegression(random_state=0)
  3. classifier.fit(X_train, y_train)

最后,构建 Lime 图表并获取原始值:

  1. import lime
  2. import lime.lime_tabular
  3. explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
  4. feature_names=X_train.columns,
  5. verbose=True, mode='classification')
  6. exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
  7. exp.as_pyplot_figure()

上述代码中的Lime图表显示了编码后的特征/列值。如果您需要显示原始值,例如,如果Lime图表显示为0,您需要将其显示为female。要实现这一点,您可以创建一个映射,将编码值映射回原始值。这样,您可以根据需要显示原始值。

英文:

I am trying to work on local explainability using Lime graph. Before building the model, I encode some of the categorical variables.

Sample Data and code:

  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. from sklearn.model_selection import train_test_split
  5. from sklearn.preprocessing import StandardScaler
  6. from sklearn.linear_model import LogisticRegression
  7. from sklearn.metrics import confusion_matrix
  8. df = pd.DataFrame({&#39;customer_id&#39; : np.arange(1,21),
  9. &#39;gender&#39; : np.random.choice([&#39;male&#39;,&#39;female&#39;], 20),
  10. &#39;age&#39; : np.random.randint(19,50, 20),
  11. &#39;salary&#39; : np.random.randint(20000,95000, 20),
  12. &#39;purchased&#39; : np.random.choice([0,1], 20, p = [.8,.2])})

Preprocessing:

  1. df[&#39;gender&#39;] = df[&#39;gender&#39;].map({&#39;female&#39; : 0, &#39;male&#39; : 1})
  2. df[&#39;age&#39;] = df[&#39;age&#39;].map(lambda x : &#39;young&#39; if x&lt;=35 else &#39;middle aged&#39;)
  3. df[&#39;age&#39;] = df[&#39;age&#39;].map({&#39;young&#39; : 0, &#39;middle aged&#39; : 1})
  4. bins = [0, df[&#39;salary&#39;].quantile(q=.33),df[&#39;salary&#39;].quantile(q=.66),df[&#39;salary&#39;].quantile(q=1)+1]
  5. labels = [&#39;low salary&#39;, &#39;medium salary&#39;, &#39;high salary&#39;]
  6. df[&#39;salary&#39;] = pd.cut(df[&#39;salary&#39;], bins = bins, labels=labels)
  7. from sklearn import preprocessing
  8. l_encoder={}
  9. label_encoder = preprocessing.LabelEncoder()
  10. df[&#39;salary&#39;]= label_encoder.fit_transform(df[&#39;salary&#39;])
  11. df
  12. customer_id gender age salary purchased
  13. 0 1 0 0 1 0
  14. 1 2 0 0 0 0
  15. 2 3 0 1 2 0
  16. 3 4 1 0 0 0
  17. 4 5 1 1 2 0
  18. 5 6 0 1 1 0
  19. 6 7 1 0 2 0
  20. 7 8 1 1 0 0
  21. 8 9 1 1 1 0
  22. 9 10 1 0 0 0
  23. 10 11 0 1 0 0
  24. 11 12 0 0 1 0
  25. 12 13 1 1 1 0
  26. 13 14 1 1 1 0
  27. 14 15 1 1 2 1
  28. 15 16 1 1 0 0
  29. 16 17 1 1 1 0
  30. 17 18 0 0 0 0
  31. 18 19 0 0 2 0
  32. 19 20 0 0 2 0
  33. # input
  34. x = df.iloc[:, :-1]
  35. # output
  36. y = df.iloc[:, 4]
  37. from sklearn.model_selection import train_test_split
  38. X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

Separating the customer_id column:

  1. X_train_cust = X_train.pop(&#39;customer_id&#39;)
  2. X_test_cust = X_test.pop(&#39;customer_id&#39;)

Fitting a logistic regression model:

  1. from sklearn.linear_model import LogisticRegression
  2. classifier = LogisticRegression(random_state = 0)
  3. classifier.fit(X_train, y_train)

Building a lime chart:

  1. import lime
  2. import lime.lime_tabular
  3. explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
  4. feature_names=X_train.columns,
  5. verbose=True, mode = &#39;classification&#39;)
  6. exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
  7. exp.as_pyplot_figure()

如何获取 Lime 图中编码的分类列的原始字符串值

The lime chart displays the encoded features/columns values. But I need the original value. For example, if the lime chart says 0, I need to display it as female.
Could someone please let me know how fix it.

答案1

得分: 1

你可以使用以下代码:

  1. # 直接映射字典
  2. dmap = {'gender': {'female': 0, 'male': 1},
  3. 'age': {'young': 0, 'middle aged': 1},
  4. 'salary': {'low salary': 0, 'medium salary': 1, 'high salary': 2}}
  5. # 反向映射字典(在此未使用)
  6. rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}
  7. # 分类名称,col0->gender,col1->age,col2->salary
  8. cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}
  9. # 现在使用
  10. explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
  11. feature_names=X_train.columns,
  12. categorical_features=[0, 1, 2], # <- 前3列
  13. categorical_names=cmap, # <- 整数到字符串
  14. verbose=True, mode='classification')
  15. exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
  16. exp.as_pyplot_figure()

如何获取 Lime 图中编码的分类列的原始字符串值

阅读此教程

英文:

You can use:

  1. # Your direct mapping dictionary
  2. dmap = {&#39;gender&#39;: {&#39;female&#39; : 0, &#39;male&#39; : 1},
  3. &#39;age&#39;: {&#39;young&#39; : 0, &#39;middle aged&#39; : 1},
  4. &#39;salary&#39;: {&#39;low salary&#39;: 0, &#39;medium salary&#39;: 1, &#39;high salary&#39;: 2}}
  5. # Reverse mapping dictionary (not used hear)
  6. rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}
  7. # Categorical names, col0-&gt;gender, col1-&gt;age, col2-&gt;salary
  8. cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}
  9. # Now use
  10. explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
  11. feature_names=X_train.columns,
  12. categorical_features=[0, 1, 2], # &lt;- 3 first columns
  13. categorical_names=cmap, # &lt;- int to string
  14. verbose=True, mode = &#39;classification&#39;)
  15. exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
  16. exp.as_pyplot_figure()

如何获取 Lime 图中编码的分类列的原始字符串值

Reat this tutorial

huangapple
  • 本文由 发表于 2023年2月6日 15:29:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358443.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定