如何在pandas中为分类列生成数值映射?

huangapple go评论60阅读模式
英文:

How to generate numeric mapping for categorical columns in pandas?

问题

我想使用pandas数据框来操作分类数据,然后将它们转换为numpy数组以用于模型训练。

假设我有以下的pandas数据框:

import pandas as pd
df2 = pd.DataFrame({"c1": ['a', 'b', None], "c2": ['d', 'e', 'f']})

>>> df2
     c1 c2
0     a  d
1     b  e
2  None  f

现在我想横向"压缩分类",如下所示:

   compressed_categories
0     c1-a, c2-d           # 这可以是一个字符串,例如 "c1-a, c2-d",或数组 ["c1-a", "c2-d"],或分类数据
1     c1-b, c2-e
2     c1-nan, c2-f

接下来,我想基于compressed_categories中的唯一出现加上 "nan" 列来生成一个字典/词汇表,例如:

volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-nan": 2,
"c2-d": 3,
"c2-e": 4,
"c2-f": 5,
"c2-nan": 6,
}

这样,我可以进一步进行数值编码,如下所示:

   compressed_categories_numeric
0     [0, 3]
1     [1, 4]
2     [2, 5]

因此,我的最终目标是使其易于将它们转换为numpy数组,以便每一行,从而可以进一步转换为张量。

input_data = np.asarray(df['compressed_categories_numeric'].tolist())

然后我可以使用 input_data 来训练我的模型。

请问是否可以给我一个示例,如何进行这一系列的转换?提前感谢!

英文:

I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.

Say I have the following data frame in pandas.

import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})

>>> df2
     c1 c2
0     a  d
1     b  e
2  None  f

And now I want "compress the categories" horizontally as the following:

   compressed_categories
0     c1-a,   c2-d           <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1     c1-b,   c2-e
2     c1-nan, c2-f

Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:

volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,

}

So I can further numerically encoding then as follows:

   compressed_categories_numeric
0     [0,   4]
1     [1,   5]
2     [3,   6]

So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.

input_data = np.asarray(df['compressed_categories_numeric'].tolist())

then I can train my model using input_data.

Can anyone please show me an example how to make this series of conversion? Thanks in advance!

答案1

得分: 2

為了建立volcab字典和compressed_categories_numeric,您可以使用以下代碼:

df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)

輸出:

>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}

>>> df2
     c1 c2 compressed_categories_numeric
0     a  d                        [0, 3]
1     b  e                        [1, 4]
2  None  f                        [2, 5]

>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
       [1, 4],
       [2, 5]])
英文:

To build volcab dictionary and compressed_categories_numeric, you can use:

df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)

Output:

>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}

>>> df2
     c1 c2 compressed_categories_numeric
0     a  d                        [0, 3]
1     b  e                        [1, 4]
2  None  f                        [2, 5]

>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
       [1, 4],
       [2, 5]])

huangapple
  • 本文由 发表于 2023年2月6日 15:09:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358285.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定