OneHotEncoder – 在编码分类变量后保留特征名称

huangapple go评论103阅读模式
英文:

OneHotEncoder -- keep feature names after encoding categorical variables

问题

Here is the translated code part you requested:

  1. ### Solution
  2. 以下是我的解决方案这个解决方案非常复杂请告诉我是否有更好的方法
  3. ```python
  4. import pandas as pd
  5. from scipy.sparse import csr_matrix
  6. from sklearn.preprocessing import OneHotEncoder
  7. # 示例数据帧
  8. data = {
  9. 'id':[13,13,14,14,14,15],
  10. 'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
  11. 'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
  12. 'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']
  13. }
  14. df = pd.DataFrame(data)
  15. # 从原始数据帧创建字典以保存类别
  16. # 复杂解决方案的一部分
  17. dcts = []
  18. df_cols = ['categ', 'size']
  19. for col in df_cols:
  20. cats = df[col].astype('category')
  21. dct = dict(enumerate(cats.cat.categories))
  22. dcts.append(dct)
  23. # 转换为类别编码,否则无法构建稀疏矩阵
  24. for col in ['categ', 'size']:
  25. df[col] = df[col].astype('category').cat.codes
  26. # 按ID和名称分组成稀疏列
  27. piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse[int]')
  28. # Unstack保持稀疏格式
  29. piv = piv.unstack(fill_value=0)
  30. piv.columns = piv.columns.to_flat_index().str.join('_')
  31. # 编码会生成不良的列名
  32. encoder = OneHotEncoder(sparse_output=True)
  33. piv_enc = encoder.fit_transform(piv)
  34. piv_fin = pd.DataFrame.sparse.from_spmatrix(
  35. piv_enc, columns=encoder.get_feature_names_out())

列名看起来像这样:'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0',但我们需要原始类别标签,即'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'

我需要关于复杂部分的建议

  1. # 修复列名
  2. piv_cols = list(piv_fin.columns)
  3. for (dct, df_col) in zip(dcts, df_cols):
  4. print(df_col, dct)
  5. for i, piv_col in enumerate(piv_cols):
  6. if df_col in piv_col:
  7. if piv_col[-1:] != '-':
  8. piv_cols[i] = piv_col[:-2] + '_' + dct[int(piv_col[-1:])]
  9. piv_fin.columns = piv_cols

我相信有更好的方法,也许OneHotEncoder可以直接使用类别标签?感谢您的帮助!

  1. <details>
  2. <summary>英文:</summary>
  3. ### Question
  4. After encoding categorical columns as numbers and pivoting LONG to WIDE into a sparse matrix, I am trying to retrieve the category labels for column names. I need this information to interpret the model in a latter step.
  5. ### Solution
  6. Below is my solution, which is really convoluted, please let me know if you have a better way:
  7. ```python
  8. import pandas as pd
  9. from scipy.sparse import csr_matrix
  10. from sklearn.preprocessing import OneHotEncoder
  11. # Example dataframe
  12. data = {
  13. &#39;id&#39;:[13,13,14,14,14,15],
  14. &#39;name&#39;:[&#39;alex&#39;, &#39;mary&#39;, &#39;alex&#39;, &#39;barry&#39;, &#39;john&#39;, &#39;john&#39;],
  15. &#39;categ&#39;: [&#39;dog&#39;, &#39;cat&#39;, &#39;dog&#39;, &#39;ant&#39;, &#39;fox&#39;, &#39;seal&#39;],
  16. &#39;size&#39;: [&#39;big&#39;, &#39;small&#39;, &#39;big&#39;, &#39;tiny&#39;, &#39;medium&#39;, &#39;big&#39;]
  17. }
  18. df = pd.DataFrame(data)
  19. # Create dictionaries from original dataframe to save categories
  20. # Part of the convoluted solution
  21. dcts = []
  22. df_cols = [&#39;categ&#39;, &#39;size&#39;]
  23. for col in df_cols:
  24. cats = df[col].astype(&#39;category&#39;)
  25. dct = dict(enumerate(cats.cat.categories))
  26. dcts.append(dct)
  27. # Change into category codes, otherwise sparse matrix cannot be built
  28. for col in [&#39;categ&#39;, &#39;size&#39;]:
  29. df[col] = df[col].astype(&#39;category&#39;).cat.codes
  30. # Group by into sparse columns
  31. piv = df.groupby([&#39;id&#39;, &#39;name&#39;])[[&#39;categ&#39;, &#39;size&#39;]].first().astype(&#39;Sparse[int]&#39;)
  32. # Unstack keeps sparse format
  33. piv = piv.unstack(fill_value=0)
  34. piv.columns = piv.columns.to_flat_index().str.join(&#39;_&#39;)
  35. # Encoding gives poor column names
  36. encoder = OneHotEncoder(sparse_output=True)
  37. piv_enc = encoder.fit_transform(piv)
  38. piv_fin = pd.DataFrame.sparse.from_spmatrix(
  39. piv_enc, columns=encoder.get_feature_names_out())

The column names look like this: &#39;categ_alex_-&#39;, &#39;categ_alex_2.0&#39;, &#39;categ_barry_-&#39;, &#39;categ_barry_0.0&#39;, while we need the original category labels, i.e. &#39;categ_alex_-&#39;, &#39;categ_alex_dog&#39;, &#39;categ_barry_-&#39;, &#39;categ_barry_ant&#39;.

Convoluted part I need advice on

  1. # Fixing column names
  2. piv_cols = list(piv_fin.columns)
  3. for (dct, df_col) in zip(dcts, df_cols):
  4. print(df_col, dct)
  5. for i, piv_col in enumerate(piv_cols):
  6. if df_col in piv_col:
  7. if piv_col[-1:] != &#39;-&#39;:
  8. piv_cols[i] = piv_col[:-2] + &#39;_&#39; + dct[int(piv_col[-1:])]
  9. piv_fin.columns = piv_cols

I'm sure there's a better way, perhaps OneHotEncoder can use category labels directly? Thanks for help!

答案1

得分: 2

你可以通过使用字典而不是列表来保存类别,从而使事情变得更容易:

  1. # 从原始数据框创建字典以保存类别
  2. dcts = {} # 而不是 []
  3. df_cols = ["categ", "size"]
  4. for col in df_cols:
  5. cats = df[col].astype("category")
  6. dct = dict(enumerate(cats.cat.categories))
  7. dcts[col] = dct # 而不是 dcts.append(dct)

然后,使用 Python 标准库中的 str.replace

  1. # 修复列名
  2. piv_cols = [
  3. col.replace(col[-1], dcts[col.split("_")[0]][int(col[-1])])
  4. if str.isnumeric(col[-1])
  5. else col
  6. for col in piv_fin.columns
  7. ]

这样:

  1. print(piv_cols)
  2. # 输出
  3. ['categ_alex_-',
  4. 'categ_alex_dog',
  5. 'categ_barry_-',
  6. 'categ_barry_ant',
  7. 'categ_john_-',
  8. 'categ_john_fox',
  9. 'categ_john_seal',
  10. 'categ_mary_-',
  11. 'categ_mary_cat',
  12. 'size_alex_-',
  13. 'size_alex_big',
  14. 'size_barry_-',
  15. 'size_barry_tiny',
  16. 'size_john_-',
  17. 'size_john_big',
  18. 'size_john_medium',
  19. 'size_mary_-',
  20. 'size_mary_small']
英文:

You can make things a bit easier by using a dictionary instead of a list to save categories:

  1. # Create dictionaries from original dataframe to save categories
  2. dcts = {} # instead of []
  3. df_cols = [&quot;categ&quot;, &quot;size&quot;]
  4. for col in df_cols:
  5. cats = df[col].astype(&quot;category&quot;)
  6. dct = dict(enumerate(cats.cat.categories))
  7. dcts[col] = dct # instead of dcts.append(dct)

Then, using str.replace from Python standard library:

  1. # Fixing column names
  2. piv_cols = [
  3. col.replace(col[-1], dcts[col.split(&quot;_&quot;)[0]][int(col[-1])])
  4. if str.isnumeric(col[-1])
  5. else col
  6. for col in piv_fin.columns
  7. ]

So that:

  1. print(piv_cols)
  2. # Output
  3. [&#39;categ_alex_-&#39;,
  4. &#39;categ_alex_dog&#39;,
  5. &#39;categ_barry_-&#39;,
  6. &#39;categ_barry_ant&#39;,
  7. &#39;categ_john_-&#39;,
  8. &#39;categ_john_fox&#39;,
  9. &#39;categ_john_seal&#39;,
  10. &#39;categ_mary_-&#39;,
  11. &#39;categ_mary_cat&#39;,
  12. &#39;size_alex_-&#39;,
  13. &#39;size_alex_big&#39;,
  14. &#39;size_barry_-&#39;,
  15. &#39;size_barry_tiny&#39;,
  16. &#39;size_john_-&#39;,
  17. &#39;size_john_big&#39;,
  18. &#39;size_john_medium&#39;,
  19. &#39;size_mary_-&#39;,
  20. &#39;size_mary_small&#39;]

huangapple
  • 本文由 发表于 2023年4月13日 21:23:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005968.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定