2023年4月19日 15:13:34go评论75阅读模式

英文:

Frequency encoding - should I make dummy columns?

问题

I am busy prepping data to train a machine learning model. Many of the features can work well with one-hot encoding. For one feature, the frequency is related to the target variable. I have been reading up a bit about frequency encoding and everything I find has to do with replacing the category with the frequency, like so:

------------
Blue
Blue
Blue
Yellow
Yellow
Red

Turning into:

------------
3
3
3
2
2
1

If I had done one-hot, I would have had:

1      0        0
1      0        0
1      0        0
0      1        0
0      1        0
0      0        0

All of the above may be related to a single line in the data I want to join it to, so I was wondering if this would be useful:

3      2        1

If so, is there a simple way to do it, like get_dummies()?

英文:

CategoryName
------------
Blue
Blue
Blue
Yellow
Yellow
Red

Turning into:

CategoryName
------------
3
3
3
2
2
1

If I had done one-hot, I would have had:

Blue   Yellow   Red
1      0        0
1      0        0
1      0        0
0      1        0
0      1        0
0      0        0

All of the above may be related to a single line in the data I want to join it to, so I was wondering if this would be useful:

Blue   Yellow   Red
3      2        1

If so, is there a simple way to do it, like get_dummies()?

答案1

得分: 2

使用 [tag:pandas]，您可以使用自 groupby.transform('size') 来进行操作：

out = df.groupby(&#39;CategoryName&#39;)[&#39;CategoryName&#39;].transform(&#39;size&#39;)

如果您已经计算了虚拟变量（dummies），只需对它们的 sum 进行 map 处理：

dummies = pd.get_dummies(df[&#39;CategoryName&#39;])
out = df[&#39;CategoryName&#39;].map(dummies.sum())

输出结果：

0    3
1    3
2    3
3    2
4    2
5    1
Name: CategoryName, dtype: int64

英文:

With [tag:pandas] you can use a self groupby.transform('size'):

out = df.groupby(&#39;CategoryName&#39;)[&#39;CategoryName&#39;].transform(&#39;size&#39;)

If you already calculated the dummies, just map their sum:

dummies = pd.get_dummies(df[&#39;CategoryName&#39;])
out = df[&#39;CategoryName&#39;].map(dummies.sum())

Output:

0    3
1    3
2    3
3    2
4    2
5    1
Name: CategoryName, dtype: int64

答案2

得分: 2

你可以使用value_counts作为映射字典：

df['CategoryName'].replace(df.value_counts())
0    3
1    3
2    3
3    2
4    2
5    1
Name: CategoryName, dtype: int64

注意：如果一些值具有相同的频率，例如['Blue', 'Blue', 'Yellow', 'Yellow', 'Red']，结果将是[2, 2, 2, 2, 1]。

更新

要获取只有一行的结果：

df.value_counts().to_frame().T
  Blue Yellow Red
0    3      2   1

更新 2

目标变量可以受到果园有多少次应用的影响，而不仅仅是是否有应用。

在这种情况下，所有肥料应该作为单独的变量进行处理：

df
  Orchad Fertilizer
0      A         F1
1      A         F1
2      A         F1
3      A         F1
4      A         F2
5      A         F2
6      B         F1
7      B         F1
8      B         F1
9      B         F3

pd.crosstab(df['Orchad'], df['Fertilizer'])
Fertilizer  F1  F2  F3
Orchad                
A            4   2   0
B            3   0   1

# 使用normalize={'all'|'index'|'columns'}或者StandardScaler
pd.crosstab(df['Orchad'], df['Fertilizer'], normalize='columns')
Fertilizer        F1   F2   F3
Orchad                        
A           0.571429  1.0  0.0
B           0.428571  0.0  1.0

英文:

You can use value_counts as mapping dict:

&gt;&gt;&gt; df[&#39;CategoryName&#39;].replace(df.value_counts())
0    3
1    3
2    3
3    2
4    2
5    1
Name: CategoryName, dtype: int64

Note: take care if some values have the same frequency like ['Blue', 'Blue', 'Yellow', 'Yellow', 'Red'], the result will be [2, 2, 2, 2, 1]

Update

To get only one line:

&gt;&gt;&gt; df.value_counts().to_frame().T
  Blue Yellow Red
0    3      2   1

Update 2

> The target variable can be influenced by how many applications the orchard has had, not just whether it had an application or not.

It makes sense so in this case, all fertilizers should be process as separated variables:

&gt;&gt;&gt; df
  Orchad Fertilizer
0      A         F1
1      A         F1
2      A         F1
3      A         F1
4      A         F2
5      A         F2
6      B         F1
7      B         F1
8      B         F1
9      B         F3

&gt;&gt;&gt; pd.crosstab(df[&#39;Orchad&#39;], df[&#39;Fertilizer&#39;])
Fertilizer  F1  F2  F3
Orchad                
A            4   2   0
B            3   0   1

# Use normalize={&#39;all&#39;|&#39;index&#39;|&#39;columns&#39;} or StandardScaler
&gt;&gt;&gt; pd.crosstab(df[&#39;Orchad&#39;], df[&#39;Fertilizer&#39;], normalize=&#39;columns&#39;)
Fertilizer        F1   F2   F3
Orchad                        
A           0.571429  1.0  0.0
B           0.428571  0.0  1.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

频率编码 – 我应该创建虚拟列吗？

问题

答案1

答案2

ValueError: 无法赋值 “value”，”Order.dish_name” 必须是一个 “Dish” 实例

如何更改 Visual Studio Code 中默认的虚拟环境命令？

如何在pandas中防止条形图相互叠加？

How to custom sort datetime column in pandas?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论