如何在pandas中为分类列生成数值映射?

huangapple go评论100阅读模式
英文:

How to generate numeric mapping for categorical columns in pandas?

问题

我想使用pandas数据框来操作分类数据,然后将它们转换为numpy数组以用于模型训练。

假设我有以下的pandas数据框:

  1. import pandas as pd
  2. df2 = pd.DataFrame({"c1": ['a', 'b', None], "c2": ['d', 'e', 'f']})
  3. >>> df2
  4. c1 c2
  5. 0 a d
  6. 1 b e
  7. 2 None f

现在我想横向"压缩分类",如下所示:

  1. compressed_categories
  2. 0 c1-a, c2-d # 这可以是一个字符串,例如 "c1-a, c2-d",或数组 ["c1-a", "c2-d"],或分类数据
  3. 1 c1-b, c2-e
  4. 2 c1-nan, c2-f

接下来,我想基于compressed_categories中的唯一出现加上 "nan" 列来生成一个字典/词汇表,例如:

  1. volcab = {
  2. "c1-a": 0,
  3. "c1-b": 1,
  4. "c1-nan": 2,
  5. "c2-d": 3,
  6. "c2-e": 4,
  7. "c2-f": 5,
  8. "c2-nan": 6,
  9. }

这样,我可以进一步进行数值编码,如下所示:

  1. compressed_categories_numeric
  2. 0 [0, 3]
  3. 1 [1, 4]
  4. 2 [2, 5]

因此,我的最终目标是使其易于将它们转换为numpy数组,以便每一行,从而可以进一步转换为张量。

  1. input_data = np.asarray(df['compressed_categories_numeric'].tolist())

然后我可以使用 input_data 来训练我的模型。

请问是否可以给我一个示例,如何进行这一系列的转换?提前感谢!

英文:

I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.

Say I have the following data frame in pandas.

  1. import pandas as pd
  2. df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
  3. >>> df2
  4. c1 c2
  5. 0 a d
  6. 1 b e
  7. 2 None f

And now I want "compress the categories" horizontally as the following:

  1. compressed_categories
  2. 0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
  3. 1 c1-b, c2-e
  4. 2 c1-nan, c2-f

Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:

  1. volcab = {
  2. "c1-a": 0,
  3. "c1-b": 1,
  4. "c1-c": 2,
  5. "c1-nan": 3,
  6. "c2-d": 4,
  7. "c2-e": 5,
  8. "c2-f": 6,
  9. "c2-nan": 7,
  10. }

So I can further numerically encoding then as follows:

  1. compressed_categories_numeric
  2. 0 [0, 4]
  3. 1 [1, 5]
  4. 2 [3, 6]

So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.

  1. input_data = np.asarray(df['compressed_categories_numeric'].tolist())

then I can train my model using input_data.

Can anyone please show me an example how to make this series of conversion? Thanks in advance!

答案1

得分: 2

為了建立volcab字典和compressed_categories_numeric,您可以使用以下代碼:

  1. df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
  2. volcab = {k: v for v, k in enumerate(np.unique(df3))}
  3. df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)

輸出:

  1. >>> volcab
  2. {'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
  3. >>> df2
  4. c1 c2 compressed_categories_numeric
  5. 0 a d [0, 3]
  6. 1 b e [1, 4]
  7. 2 None f [2, 5]
  8. >>> np.array(df2['compressed_categories_numeric'].tolist())
  9. array([[0, 3],
  10. [1, 4],
  11. [2, 5]])
英文:

To build volcab dictionary and compressed_categories_numeric, you can use:

  1. df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
  2. volcab = {k: v for v, k in enumerate(np.unique(df3))}
  3. df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)

Output:

  1. >>> volcab
  2. {'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
  3. >>> df2
  4. c1 c2 compressed_categories_numeric
  5. 0 a d [0, 3]
  6. 1 b e [1, 4]
  7. 2 None f [2, 5]
  8. >>> np.array(df2['compressed_categories_numeric'].tolist())
  9. array([[0, 3],
  10. [1, 4],
  11. [2, 5]])

huangapple
  • 本文由 发表于 2023年2月6日 15:09:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358285.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定