2023年2月6日 15:09:09go评论100阅读模式

英文:

How to generate numeric mapping for categorical columns in pandas?

问题

我想使用pandas数据框来操作分类数据，然后将它们转换为numpy数组以用于模型训练。

假设我有以下的pandas数据框：

import pandas as pd
df2 = pd.DataFrame({"c1": ['a', 'b', None], "c2": ['d', 'e', 'f']})
>>> df2
     c1 c2
0     a  d
1     b  e
2  None  f

现在我想横向"压缩分类"，如下所示：

   compressed_categories
0     c1-a, c2-d           # 这可以是一个字符串，例如 "c1-a, c2-d"，或数组 ["c1-a", "c2-d"]，或分类数据
1     c1-b, c2-e
2     c1-nan, c2-f

接下来，我想基于compressed_categories中的唯一出现加上 "nan" 列来生成一个字典/词汇表，例如：

volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-nan": 2,
"c2-d": 3,
"c2-e": 4,
"c2-f": 5,
"c2-nan": 6,
}

这样，我可以进一步进行数值编码，如下所示：

   compressed_categories_numeric
0     [0, 3]
1     [1, 4]
2     [2, 5]

因此，我的最终目标是使其易于将它们转换为numpy数组，以便每一行，从而可以进一步转换为张量。

input_data = np.asarray(df['compressed_categories_numeric'].tolist())

然后我可以使用 input_data 来训练我的模型。

请问是否可以给我一个示例，如何进行这一系列的转换？提前感谢！

英文:

I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.

Say I have the following data frame in pandas.

import pandas as pd
df2 = pd.DataFrame({&quot;c1&quot;: [&#39;a&#39;,&#39;b&#39;,None], &quot;c2&quot;: [&#39;d&#39;,&#39;e&#39;,&#39;f&#39;]})
&gt;&gt;&gt; df2
     c1 c2
0     a  d
1     b  e
2  None  f

And now I want "compress the categories" horizontally as the following:

   compressed_categories
0     c1-a,   c2-d           &lt;--- this could be a string, ex. &quot;c1-a, c2-d&quot; or array [&quot;c1-a&quot;, &quot;c2-d&quot;] or categorical data
1     c1-b,   c2-e
2     c1-nan, c2-f

Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:

volcab = {
&quot;c1-a&quot;: 0,
&quot;c1-b&quot;: 1,
&quot;c1-c&quot;: 2,
&quot;c1-nan&quot;: 3,
&quot;c2-d&quot;: 4,
&quot;c2-e&quot;: 5,
&quot;c2-f&quot;: 6,
&quot;c2-nan&quot;: 7,
}

So I can further numerically encoding then as follows:

   compressed_categories_numeric
0     [0,   4]
1     [1,   5]
2     [3,   6]

So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.

input_data = np.asarray(df[&#39;compressed_categories_numeric&#39;].tolist())

then I can train my model using input_data.

Can anyone please show me an example how to make this series of conversion? Thanks in advance!

答案1

得分: 2

為了建立volcab字典和compressed_categories_numeric，您可以使用以下代碼：

df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)

輸出：

>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
     c1 c2 compressed_categories_numeric
0     a  d                        [0, 3]
1     b  e                        [1, 4]
2  None  f                        [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
       [1, 4],
       [2, 5]])

英文:

To build volcab dictionary and compressed_categories_numeric, you can use:

df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + &#39;-&#39; + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2[&#39;compressed_categories_numeric&#39;] = df3.replace(volcab).agg(list, axis=1)

Output:

&gt;&gt;&gt; volcab
{&#39;c1-a&#39;: 0, &#39;c1-b&#39;: 1, &#39;c1-nan&#39;: 2, &#39;c2-d&#39;: 3, &#39;c2-e&#39;: 4, &#39;c2-f&#39;: 5}
&gt;&gt;&gt; df2
     c1 c2 compressed_categories_numeric
0     a  d                        [0, 3]
1     b  e                        [1, 4]
2  None  f                        [2, 5]
&gt;&gt;&gt; np.array(df2[&#39;compressed_categories_numeric&#39;].tolist())
array([[0, 3],
       [1, 4],
       [2, 5]])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在pandas中为分类列生成数值映射？

问题

答案1

我的列表在切换时只显示“9”作为第一个索引。

停止主线程直到ThreadPoolExecutor中的所有任务完成 – Python DJANGO

在`init`文件内模拟环境变量。

检测在Polars中未给定唯一性的行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。