频率编码 – 我应该创建虚拟列吗?

huangapple go评论75阅读模式
英文:

Frequency encoding - should I make dummy columns?

问题

I am busy prepping data to train a machine learning model. Many of the features can work well with one-hot encoding. For one feature, the frequency is related to the target variable. I have been reading up a bit about frequency encoding and everything I find has to do with replacing the category with the frequency, like so:

------------
Blue
Blue
Blue
Yellow
Yellow
Red

Turning into:

------------
3
3
3
2
2
1

If I had done one-hot, I would have had:

1      0        0
1      0        0
1      0        0
0      1        0
0      1        0
0      0        0

All of the above may be related to a single line in the data I want to join it to, so I was wondering if this would be useful:

3      2        1

If so, is there a simple way to do it, like get_dummies()?

英文:

I am busy prepping data to train a machine learning model. Many of the features can work well with one-hot encoding. For one feature, the frequency is related to the target variable. I have been reading up a bit about frequency encoding and everything I find has to do with replacing the category with the frequency, like so:

CategoryName
------------
Blue
Blue
Blue
Yellow
Yellow
Red

Turning into:

CategoryName
------------
3
3
3
2
2
1

If I had done one-hot, I would have had:

Blue   Yellow   Red
1      0        0
1      0        0
1      0        0
0      1        0
0      1        0
0      0        0

All of the above may be related to a single line in the data I want to join it to, so I was wondering if this would be useful:

Blue   Yellow   Red
3      2        1

If so, is there a simple way to do it, like get_dummies()?

答案1

得分: 2

使用 [tag:pandas],您可以使用自 groupby.transform('size') 来进行操作:

out = df.groupby('CategoryName')['CategoryName'].transform('size')

如果您已经计算了虚拟变量(dummies),只需对它们的 sum 进行 map 处理:

dummies = pd.get_dummies(df['CategoryName'])
out = df['CategoryName'].map(dummies.sum())

输出结果:

0    3
1    3
2    3
3    2
4    2
5    1
Name: CategoryName, dtype: int64
英文:

With [tag:pandas] you can use a self groupby.transform('size'):

out = df.groupby('CategoryName')['CategoryName'].transform('size')

If you already calculated the dummies, just map their sum:

dummies = pd.get_dummies(df['CategoryName'])
out = df['CategoryName'].map(dummies.sum())

Output:

0    3
1    3
2    3
3    2
4    2
5    1
Name: CategoryName, dtype: int64

答案2

得分: 2

你可以使用value_counts作为映射字典:

df['CategoryName'].replace(df.value_counts())
0    3
1    3
2    3
3    2
4    2
5    1
Name: CategoryName, dtype: int64

注意:如果一些值具有相同的频率,例如['Blue', 'Blue', 'Yellow', 'Yellow', 'Red'],结果将是[2, 2, 2, 2, 1]

更新

要获取只有一行的结果:

df.value_counts().to_frame().T
  Blue Yellow Red
0    3      2   1

更新 2

目标变量可以受到果园有多少次应用的影响,而不仅仅是是否有应用。

在这种情况下,所有肥料应该作为单独的变量进行处理:

df
  Orchad Fertilizer
0      A         F1
1      A         F1
2      A         F1
3      A         F1
4      A         F2
5      A         F2
6      B         F1
7      B         F1
8      B         F1
9      B         F3

pd.crosstab(df['Orchad'], df['Fertilizer'])
Fertilizer  F1  F2  F3
Orchad                
A            4   2   0
B            3   0   1

# 使用normalize={'all'|'index'|'columns'}或者StandardScaler
pd.crosstab(df['Orchad'], df['Fertilizer'], normalize='columns')
Fertilizer        F1   F2   F3
Orchad                        
A           0.571429  1.0  0.0
B           0.428571  0.0  1.0
英文:

You can use value_counts as mapping dict:

>>> df['CategoryName'].replace(df.value_counts())
0    3
1    3
2    3
3    2
4    2
5    1
Name: CategoryName, dtype: int64

Note: take care if some values have the same frequency like ['Blue', 'Blue', 'Yellow', 'Yellow', 'Red'], the result will be [2, 2, 2, 2, 1]

Update

To get only one line:

>>> df.value_counts().to_frame().T
  Blue Yellow Red
0    3      2   1

Update 2

> The target variable can be influenced by how many applications the orchard has had, not just whether it had an application or not.

It makes sense so in this case, all fertilizers should be process as separated variables:

>>> df
  Orchad Fertilizer
0      A         F1
1      A         F1
2      A         F1
3      A         F1
4      A         F2
5      A         F2
6      B         F1
7      B         F1
8      B         F1
9      B         F3

>>> pd.crosstab(df['Orchad'], df['Fertilizer'])
Fertilizer  F1  F2  F3
Orchad                
A            4   2   0
B            3   0   1

# Use normalize={'all'|'index'|'columns'} or StandardScaler
>>> pd.crosstab(df['Orchad'], df['Fertilizer'], normalize='columns')
Fertilizer        F1   F2   F3
Orchad                        
A           0.571429  1.0  0.0
B           0.428571  0.0  1.0

huangapple
  • 本文由 发表于 2023年4月19日 15:13:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76051674.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定