OneHotEncoder – 在编码分类变量后保留特征名称

huangapple go评论68阅读模式
英文:

OneHotEncoder -- keep feature names after encoding categorical variables

问题

Here is the translated code part you requested:

### Solution
以下是我的解决方案这个解决方案非常复杂请告诉我是否有更好的方法

```python
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder

# 示例数据帧
data = {
  'id':[13,13,14,14,14,15],
  'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
  'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
  'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']
}
df = pd.DataFrame(data)

# 从原始数据帧创建字典以保存类别
# 复杂解决方案的一部分
dcts = []
df_cols = ['categ', 'size']

for col in df_cols:
    cats = df[col].astype('category')
    dct = dict(enumerate(cats.cat.categories))
    dcts.append(dct)

# 转换为类别编码,否则无法构建稀疏矩阵
for col in ['categ', 'size']:
    df[col] = df[col].astype('category').cat.codes

# 按ID和名称分组成稀疏列
piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse[int]')

# Unstack保持稀疏格式
piv = piv.unstack(fill_value=0)

piv.columns = piv.columns.to_flat_index().str.join('_')

# 编码会生成不良的列名
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
    piv_enc, columns=encoder.get_feature_names_out())

列名看起来像这样:'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0',但我们需要原始类别标签,即'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'

我需要关于复杂部分的建议

# 修复列名
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
    print(df_col, dct)
    for i, piv_col in enumerate(piv_cols):
        if df_col in piv_col:
            if piv_col[-1:] != '-':
                piv_cols[i] = piv_col[:-2] + '_' + dct[int(piv_col[-1:])]

piv_fin.columns = piv_cols

我相信有更好的方法,也许OneHotEncoder可以直接使用类别标签?感谢您的帮助!


<details>
<summary>英文:</summary>

### Question
After encoding categorical columns as numbers and pivoting LONG to WIDE into a sparse matrix, I am trying to retrieve the category labels for column names. I need this information to interpret the model in a latter step. 

### Solution
Below is my solution, which is really convoluted, please let me know if you have a better way:

```python
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder

# Example dataframe
data = {
  &#39;id&#39;:[13,13,14,14,14,15],
  &#39;name&#39;:[&#39;alex&#39;, &#39;mary&#39;, &#39;alex&#39;, &#39;barry&#39;, &#39;john&#39;, &#39;john&#39;],
  &#39;categ&#39;: [&#39;dog&#39;, &#39;cat&#39;, &#39;dog&#39;, &#39;ant&#39;, &#39;fox&#39;, &#39;seal&#39;],
  &#39;size&#39;: [&#39;big&#39;, &#39;small&#39;, &#39;big&#39;, &#39;tiny&#39;, &#39;medium&#39;, &#39;big&#39;]

}
df = pd.DataFrame(data)

# Create dictionaries from original dataframe to save categories
# Part of the convoluted solution
dcts = []
df_cols = [&#39;categ&#39;, &#39;size&#39;]

for col in df_cols:
    cats = df[col].astype(&#39;category&#39;)
    dct = dict(enumerate(cats.cat.categories))
    dcts.append(dct)

# Change into category codes, otherwise sparse matrix cannot be built
for col in [&#39;categ&#39;, &#39;size&#39;]:
    df[col] = df[col].astype(&#39;category&#39;).cat.codes

# Group by into sparse columns
piv = df.groupby([&#39;id&#39;, &#39;name&#39;])[[&#39;categ&#39;, &#39;size&#39;]].first().astype(&#39;Sparse[int]&#39;)

# Unstack keeps sparse format
piv = piv.unstack(fill_value=0)

piv.columns = piv.columns.to_flat_index().str.join(&#39;_&#39;)

# Encoding gives poor column names
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
    piv_enc, columns=encoder.get_feature_names_out())

The column names look like this: &#39;categ_alex_-&#39;, &#39;categ_alex_2.0&#39;, &#39;categ_barry_-&#39;, &#39;categ_barry_0.0&#39;, while we need the original category labels, i.e. &#39;categ_alex_-&#39;, &#39;categ_alex_dog&#39;, &#39;categ_barry_-&#39;, &#39;categ_barry_ant&#39;.

Convoluted part I need advice on

# Fixing column names
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
    print(df_col, dct)
    for i, piv_col in enumerate(piv_cols):
        if df_col in piv_col:
            if piv_col[-1:] != &#39;-&#39;:
                piv_cols[i] = piv_col[:-2] + &#39;_&#39; + dct[int(piv_col[-1:])]

piv_fin.columns = piv_cols
 

I'm sure there's a better way, perhaps OneHotEncoder can use category labels directly? Thanks for help!

答案1

得分: 2

你可以通过使用字典而不是列表来保存类别,从而使事情变得更容易:

# 从原始数据框创建字典以保存类别
dcts = {}  # 而不是 []
df_cols = ["categ", "size"]

for col in df_cols:
    cats = df[col].astype("category")
    dct = dict(enumerate(cats.cat.categories))
    dcts[col] = dct  # 而不是 dcts.append(dct)

然后,使用 Python 标准库中的 str.replace

# 修复列名
piv_cols = [
    col.replace(col[-1], dcts[col.split("_")[0]][int(col[-1])])
    if str.isnumeric(col[-1])
    else col
    for col in piv_fin.columns
]

这样:

print(piv_cols)
# 输出

['categ_alex_-',
 'categ_alex_dog',
 'categ_barry_-',
 'categ_barry_ant',
 'categ_john_-',
 'categ_john_fox',
 'categ_john_seal',
 'categ_mary_-',
 'categ_mary_cat',
 'size_alex_-',
 'size_alex_big',
 'size_barry_-',
 'size_barry_tiny',
 'size_john_-',
 'size_john_big',
 'size_john_medium',
 'size_mary_-',
 'size_mary_small']
英文:

You can make things a bit easier by using a dictionary instead of a list to save categories:

# Create dictionaries from original dataframe to save categories
dcts = {}  # instead of []
df_cols = [&quot;categ&quot;, &quot;size&quot;]

for col in df_cols:
    cats = df[col].astype(&quot;category&quot;)
    dct = dict(enumerate(cats.cat.categories))
    dcts[col] = dct  # instead of dcts.append(dct)

Then, using str.replace from Python standard library:

# Fixing column names
piv_cols = [
    col.replace(col[-1], dcts[col.split(&quot;_&quot;)[0]][int(col[-1])])
    if str.isnumeric(col[-1])
    else col
    for col in piv_fin.columns
]

So that:

print(piv_cols)
# Output

[&#39;categ_alex_-&#39;,
 &#39;categ_alex_dog&#39;,
 &#39;categ_barry_-&#39;,
 &#39;categ_barry_ant&#39;,
 &#39;categ_john_-&#39;,
 &#39;categ_john_fox&#39;,
 &#39;categ_john_seal&#39;,
 &#39;categ_mary_-&#39;,
 &#39;categ_mary_cat&#39;,
 &#39;size_alex_-&#39;,
 &#39;size_alex_big&#39;,
 &#39;size_barry_-&#39;,
 &#39;size_barry_tiny&#39;,
 &#39;size_john_-&#39;,
 &#39;size_john_big&#39;,
 &#39;size_john_medium&#39;,
 &#39;size_mary_-&#39;,
 &#39;size_mary_small&#39;]

huangapple
  • 本文由 发表于 2023年4月13日 21:23:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005968.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定