2023年4月13日 21:23:47go评论103阅读模式

英文:

OneHotEncoder -- keep feature names after encoding categorical variables

问题

Here is the translated code part you requested:

### Solution
以下是我的解决方案，这个解决方案非常复杂，请告诉我是否有更好的方法：
```python
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder
# 示例数据帧
data = {
  'id':[13,13,14,14,14,15],
  'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
  'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
  'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']
}
df = pd.DataFrame(data)
# 从原始数据帧创建字典以保存类别
# 复杂解决方案的一部分
dcts = []
df_cols = ['categ', 'size']
for col in df_cols:
    cats = df[col].astype('category')
    dct = dict(enumerate(cats.cat.categories))
    dcts.append(dct)
# 转换为类别编码，否则无法构建稀疏矩阵
for col in ['categ', 'size']:
    df[col] = df[col].astype('category').cat.codes
# 按ID和名称分组成稀疏列
piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse[int]')
# Unstack保持稀疏格式
piv = piv.unstack(fill_value=0)
piv.columns = piv.columns.to_flat_index().str.join('_')
# 编码会生成不良的列名
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
    piv_enc, columns=encoder.get_feature_names_out())

列名看起来像这样：'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0'，但我们需要原始类别标签，即'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'。

我需要关于复杂部分的建议

# 修复列名
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
    print(df_col, dct)
    for i, piv_col in enumerate(piv_cols):
        if df_col in piv_col:
            if piv_col[-1:] != '-':
                piv_cols[i] = piv_col[:-2] + '_' + dct[int(piv_col[-1:])]
piv_fin.columns = piv_cols

我相信有更好的方法，也许OneHotEncoder可以直接使用类别标签？感谢您的帮助！


<details>
<summary>英文:</summary>
### Question
After encoding categorical columns as numbers and pivoting LONG to WIDE into a sparse matrix, I am trying to retrieve the category labels for column names. I need this information to interpret the model in a latter step. 
### Solution
Below is my solution, which is really convoluted, please let me know if you have a better way:
```python
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder
# Example dataframe
data = {
  &#39;id&#39;:[13,13,14,14,14,15],
  &#39;name&#39;:[&#39;alex&#39;, &#39;mary&#39;, &#39;alex&#39;, &#39;barry&#39;, &#39;john&#39;, &#39;john&#39;],
  &#39;categ&#39;: [&#39;dog&#39;, &#39;cat&#39;, &#39;dog&#39;, &#39;ant&#39;, &#39;fox&#39;, &#39;seal&#39;],
  &#39;size&#39;: [&#39;big&#39;, &#39;small&#39;, &#39;big&#39;, &#39;tiny&#39;, &#39;medium&#39;, &#39;big&#39;]
}
df = pd.DataFrame(data)
# Create dictionaries from original dataframe to save categories
# Part of the convoluted solution
dcts = []
df_cols = [&#39;categ&#39;, &#39;size&#39;]
for col in df_cols:
    cats = df[col].astype(&#39;category&#39;)
    dct = dict(enumerate(cats.cat.categories))
    dcts.append(dct)
# Change into category codes, otherwise sparse matrix cannot be built
for col in [&#39;categ&#39;, &#39;size&#39;]:
    df[col] = df[col].astype(&#39;category&#39;).cat.codes
# Group by into sparse columns
piv = df.groupby([&#39;id&#39;, &#39;name&#39;])[[&#39;categ&#39;, &#39;size&#39;]].first().astype(&#39;Sparse[int]&#39;)
# Unstack keeps sparse format
piv = piv.unstack(fill_value=0)
piv.columns = piv.columns.to_flat_index().str.join(&#39;_&#39;)
# Encoding gives poor column names
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
    piv_enc, columns=encoder.get_feature_names_out())

The column names look like this: 'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0', while we need the original category labels, i.e. 'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'.

Convoluted part I need advice on

# Fixing column names
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
    print(df_col, dct)
    for i, piv_col in enumerate(piv_cols):
        if df_col in piv_col:
            if piv_col[-1:] != &#39;-&#39;:
                piv_cols[i] = piv_col[:-2] + &#39;_&#39; + dct[int(piv_col[-1:])]
piv_fin.columns = piv_cols

I'm sure there's a better way, perhaps OneHotEncoder can use category labels directly? Thanks for help!

答案1

得分: 2

你可以通过使用字典而不是列表来保存类别，从而使事情变得更容易：

# 从原始数据框创建字典以保存类别
dcts = {}  # 而不是 []
df_cols = ["categ", "size"]
for col in df_cols:
    cats = df[col].astype("category")
    dct = dict(enumerate(cats.cat.categories))
    dcts[col] = dct  # 而不是 dcts.append(dct)

然后，使用 Python 标准库中的 str.replace：

# 修复列名
piv_cols = [
    col.replace(col[-1], dcts[col.split("_")[0]][int(col[-1])])
    if str.isnumeric(col[-1])
    else col
    for col in piv_fin.columns
]

这样：

print(piv_cols)
# 输出
['categ_alex_-',
 'categ_alex_dog',
 'categ_barry_-',
 'categ_barry_ant',
 'categ_john_-',
 'categ_john_fox',
 'categ_john_seal',
 'categ_mary_-',
 'categ_mary_cat',
 'size_alex_-',
 'size_alex_big',
 'size_barry_-',
 'size_barry_tiny',
 'size_john_-',
 'size_john_big',
 'size_john_medium',
 'size_mary_-',
 'size_mary_small']

英文:

You can make things a bit easier by using a dictionary instead of a list to save categories:

# Create dictionaries from original dataframe to save categories
dcts = {}  # instead of []
df_cols = [&quot;categ&quot;, &quot;size&quot;]
for col in df_cols:
    cats = df[col].astype(&quot;category&quot;)
    dct = dict(enumerate(cats.cat.categories))
    dcts[col] = dct  # instead of dcts.append(dct)

Then, using str.replace from Python standard library:

# Fixing column names
piv_cols = [
    col.replace(col[-1], dcts[col.split(&quot;_&quot;)[0]][int(col[-1])])
    if str.isnumeric(col[-1])
    else col
    for col in piv_fin.columns
]

So that:

print(piv_cols)
# Output
[&#39;categ_alex_-&#39;,
 &#39;categ_alex_dog&#39;,
 &#39;categ_barry_-&#39;,
 &#39;categ_barry_ant&#39;,
 &#39;categ_john_-&#39;,
 &#39;categ_john_fox&#39;,
 &#39;categ_john_seal&#39;,
 &#39;categ_mary_-&#39;,
 &#39;categ_mary_cat&#39;,
 &#39;size_alex_-&#39;,
 &#39;size_alex_big&#39;,
 &#39;size_barry_-&#39;,
 &#39;size_barry_tiny&#39;,
 &#39;size_john_-&#39;,
 &#39;size_john_big&#39;,
 &#39;size_john_medium&#39;,
 &#39;size_mary_-&#39;,
 &#39;size_mary_small&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

OneHotEncoder – 在编码分类变量后保留特征名称

问题

我需要关于复杂部分的建议

Convoluted part I need advice on

答案1

将字典值分配给特定列，根据字典键

Python：在数据集中计算每个组的第2和第3四分位数。

Unpacking enums at the global scope increase memory usage?

(Scrapy) 如何将变量传递给ItemLoader

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。