英文:
How to generate numeric mapping for categorical columns in pandas?
问题
我想使用pandas数据框来操作分类数据,然后将它们转换为numpy
数组以用于模型训练。
假设我有以下的pandas数据框:
import pandas as pd
df2 = pd.DataFrame({"c1": ['a', 'b', None], "c2": ['d', 'e', 'f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
现在我想横向"压缩分类",如下所示:
compressed_categories
0 c1-a, c2-d # 这可以是一个字符串,例如 "c1-a, c2-d",或数组 ["c1-a", "c2-d"],或分类数据
1 c1-b, c2-e
2 c1-nan, c2-f
接下来,我想基于compressed_categories
中的唯一出现加上 "nan" 列来生成一个字典/词汇表,例如:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-nan": 2,
"c2-d": 3,
"c2-e": 4,
"c2-f": 5,
"c2-nan": 6,
}
这样,我可以进一步进行数值编码,如下所示:
compressed_categories_numeric
0 [0, 3]
1 [1, 4]
2 [2, 5]
因此,我的最终目标是使其易于将它们转换为numpy
数组,以便每一行,从而可以进一步转换为张量。
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
然后我可以使用 input_data
来训练我的模型。
请问是否可以给我一个示例,如何进行这一系列的转换?提前感谢!
英文:
I want to manipulate categorical data using pandas data frame and then convert them to numpy
array for model training.
Say I have the following data frame in pandas.
import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
And now I want "compress the categories" horizontally as the following:
compressed_categories
0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1 c1-b, c2-e
2 c1-nan, c2-f
Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories
, ex:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,
}
So I can further numerically encoding then as follows:
compressed_categories_numeric
0 [0, 4]
1 [1, 5]
2 [3, 6]
So my ultimate goal is to make it easy to convert them to numpy
array for each row and thus I can further convert it to tensor.
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
then I can train my model using input_data
.
Can anyone please show me an example how to make this series of conversion? Thanks in advance!
答案1
得分: 2
為了建立volcab
字典和compressed_categories_numeric
,您可以使用以下代碼:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
輸出:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])
英文:
To build volcab
dictionary and compressed_categories_numeric
, you can use:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
Output:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论