英文:
OneHotEncoder -- keep feature names after encoding categorical variables
问题
Here is the translated code part you requested:
### Solution
以下是我的解决方案,这个解决方案非常复杂,请告诉我是否有更好的方法:
```python
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder
# 示例数据帧
data = {
'id':[13,13,14,14,14,15],
'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']
}
df = pd.DataFrame(data)
# 从原始数据帧创建字典以保存类别
# 复杂解决方案的一部分
dcts = []
df_cols = ['categ', 'size']
for col in df_cols:
cats = df[col].astype('category')
dct = dict(enumerate(cats.cat.categories))
dcts.append(dct)
# 转换为类别编码,否则无法构建稀疏矩阵
for col in ['categ', 'size']:
df[col] = df[col].astype('category').cat.codes
# 按ID和名称分组成稀疏列
piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse[int]')
# Unstack保持稀疏格式
piv = piv.unstack(fill_value=0)
piv.columns = piv.columns.to_flat_index().str.join('_')
# 编码会生成不良的列名
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
piv_enc, columns=encoder.get_feature_names_out())
列名看起来像这样:'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0'
,但我们需要原始类别标签,即'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'
。
我需要关于复杂部分的建议
# 修复列名
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
print(df_col, dct)
for i, piv_col in enumerate(piv_cols):
if df_col in piv_col:
if piv_col[-1:] != '-':
piv_cols[i] = piv_col[:-2] + '_' + dct[int(piv_col[-1:])]
piv_fin.columns = piv_cols
我相信有更好的方法,也许OneHotEncoder可以直接使用类别标签?感谢您的帮助!
<details>
<summary>英文:</summary>
### Question
After encoding categorical columns as numbers and pivoting LONG to WIDE into a sparse matrix, I am trying to retrieve the category labels for column names. I need this information to interpret the model in a latter step.
### Solution
Below is my solution, which is really convoluted, please let me know if you have a better way:
```python
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder
# Example dataframe
data = {
'id':[13,13,14,14,14,15],
'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']
}
df = pd.DataFrame(data)
# Create dictionaries from original dataframe to save categories
# Part of the convoluted solution
dcts = []
df_cols = ['categ', 'size']
for col in df_cols:
cats = df[col].astype('category')
dct = dict(enumerate(cats.cat.categories))
dcts.append(dct)
# Change into category codes, otherwise sparse matrix cannot be built
for col in ['categ', 'size']:
df[col] = df[col].astype('category').cat.codes
# Group by into sparse columns
piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse[int]')
# Unstack keeps sparse format
piv = piv.unstack(fill_value=0)
piv.columns = piv.columns.to_flat_index().str.join('_')
# Encoding gives poor column names
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
piv_enc, columns=encoder.get_feature_names_out())
The column names look like this: 'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0'
, while we need the original category labels, i.e. 'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'
.
Convoluted part I need advice on
# Fixing column names
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
print(df_col, dct)
for i, piv_col in enumerate(piv_cols):
if df_col in piv_col:
if piv_col[-1:] != '-':
piv_cols[i] = piv_col[:-2] + '_' + dct[int(piv_col[-1:])]
piv_fin.columns = piv_cols
I'm sure there's a better way, perhaps OneHotEncoder can use category labels directly? Thanks for help!
答案1
得分: 2
你可以通过使用字典而不是列表来保存类别,从而使事情变得更容易:
# 从原始数据框创建字典以保存类别
dcts = {} # 而不是 []
df_cols = ["categ", "size"]
for col in df_cols:
cats = df[col].astype("category")
dct = dict(enumerate(cats.cat.categories))
dcts[col] = dct # 而不是 dcts.append(dct)
然后,使用 Python 标准库中的 str.replace:
# 修复列名
piv_cols = [
col.replace(col[-1], dcts[col.split("_")[0]][int(col[-1])])
if str.isnumeric(col[-1])
else col
for col in piv_fin.columns
]
这样:
print(piv_cols)
# 输出
['categ_alex_-',
'categ_alex_dog',
'categ_barry_-',
'categ_barry_ant',
'categ_john_-',
'categ_john_fox',
'categ_john_seal',
'categ_mary_-',
'categ_mary_cat',
'size_alex_-',
'size_alex_big',
'size_barry_-',
'size_barry_tiny',
'size_john_-',
'size_john_big',
'size_john_medium',
'size_mary_-',
'size_mary_small']
英文:
You can make things a bit easier by using a dictionary instead of a list to save categories:
# Create dictionaries from original dataframe to save categories
dcts = {} # instead of []
df_cols = ["categ", "size"]
for col in df_cols:
cats = df[col].astype("category")
dct = dict(enumerate(cats.cat.categories))
dcts[col] = dct # instead of dcts.append(dct)
Then, using str.replace from Python standard library:
# Fixing column names
piv_cols = [
col.replace(col[-1], dcts[col.split("_")[0]][int(col[-1])])
if str.isnumeric(col[-1])
else col
for col in piv_fin.columns
]
So that:
print(piv_cols)
# Output
['categ_alex_-',
'categ_alex_dog',
'categ_barry_-',
'categ_barry_ant',
'categ_john_-',
'categ_john_fox',
'categ_john_seal',
'categ_mary_-',
'categ_mary_cat',
'size_alex_-',
'size_alex_big',
'size_barry_-',
'size_barry_tiny',
'size_john_-',
'size_john_big',
'size_john_medium',
'size_mary_-',
'size_mary_small']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论