英文:
Frequency encoding - should I make dummy columns?
问题
I am busy prepping data to train a machine learning model. Many of the features can work well with one-hot encoding. For one feature, the frequency is related to the target variable. I have been reading up a bit about frequency encoding and everything I find has to do with replacing the category with the frequency, like so:
------------
Blue
Blue
Blue
Yellow
Yellow
Red
Turning into:
------------
3
3
3
2
2
1
If I had done one-hot, I would have had:
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 0 0
All of the above may be related to a single line in the data I want to join it to, so I was wondering if this would be useful:
3 2 1
If so, is there a simple way to do it, like get_dummies()
?
英文:
I am busy prepping data to train a machine learning model. Many of the features can work well with one-hot encoding. For one feature, the frequency is related to the target variable. I have been reading up a bit about frequency encoding and everything I find has to do with replacing the category with the frequency, like so:
CategoryName
------------
Blue
Blue
Blue
Yellow
Yellow
Red
Turning into:
CategoryName
------------
3
3
3
2
2
1
If I had done one-hot, I would have had:
Blue Yellow Red
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 0 0
All of the above may be related to a single line in the data I want to join it to, so I was wondering if this would be useful:
Blue Yellow Red
3 2 1
If so, is there a simple way to do it, like get_dummies()
?
答案1
得分: 2
使用 [tag:pandas],您可以使用自 groupby.transform('size')
来进行操作:
out = df.groupby('CategoryName')['CategoryName'].transform('size')
如果您已经计算了虚拟变量(dummies),只需对它们的 sum
进行 map
处理:
dummies = pd.get_dummies(df['CategoryName'])
out = df['CategoryName'].map(dummies.sum())
输出结果:
0 3
1 3
2 3
3 2
4 2
5 1
Name: CategoryName, dtype: int64
英文:
With [tag:pandas] you can use a self groupby.transform('size')
:
out = df.groupby('CategoryName')['CategoryName'].transform('size')
If you already calculated the dummies, just map
their sum
:
dummies = pd.get_dummies(df['CategoryName'])
out = df['CategoryName'].map(dummies.sum())
Output:
0 3
1 3
2 3
3 2
4 2
5 1
Name: CategoryName, dtype: int64
答案2
得分: 2
你可以使用value_counts
作为映射字典:
df['CategoryName'].replace(df.value_counts())
0 3
1 3
2 3
3 2
4 2
5 1
Name: CategoryName, dtype: int64
注意:如果一些值具有相同的频率,例如['Blue', 'Blue', 'Yellow', 'Yellow', 'Red']
,结果将是[2, 2, 2, 2, 1]
。
更新
要获取只有一行的结果:
df.value_counts().to_frame().T
Blue Yellow Red
0 3 2 1
更新 2
目标变量可以受到果园有多少次应用的影响,而不仅仅是是否有应用。
在这种情况下,所有肥料应该作为单独的变量进行处理:
df
Orchad Fertilizer
0 A F1
1 A F1
2 A F1
3 A F1
4 A F2
5 A F2
6 B F1
7 B F1
8 B F1
9 B F3
pd.crosstab(df['Orchad'], df['Fertilizer'])
Fertilizer F1 F2 F3
Orchad
A 4 2 0
B 3 0 1
# 使用normalize={'all'|'index'|'columns'}或者StandardScaler
pd.crosstab(df['Orchad'], df['Fertilizer'], normalize='columns')
Fertilizer F1 F2 F3
Orchad
A 0.571429 1.0 0.0
B 0.428571 0.0 1.0
英文:
You can use value_counts
as mapping dict:
>>> df['CategoryName'].replace(df.value_counts())
0 3
1 3
2 3
3 2
4 2
5 1
Name: CategoryName, dtype: int64
Note: take care if some values have the same frequency like ['Blue', 'Blue', 'Yellow', 'Yellow', 'Red']
, the result will be [2, 2, 2, 2, 1]
Update
To get only one line:
>>> df.value_counts().to_frame().T
Blue Yellow Red
0 3 2 1
Update 2
> The target variable can be influenced by how many applications the orchard has had, not just whether it had an application or not.
It makes sense so in this case, all fertilizers should be process as separated variables:
>>> df
Orchad Fertilizer
0 A F1
1 A F1
2 A F1
3 A F1
4 A F2
5 A F2
6 B F1
7 B F1
8 B F1
9 B F3
>>> pd.crosstab(df['Orchad'], df['Fertilizer'])
Fertilizer F1 F2 F3
Orchad
A 4 2 0
B 3 0 1
# Use normalize={'all'|'index'|'columns'} or StandardScaler
>>> pd.crosstab(df['Orchad'], df['Fertilizer'], normalize='columns')
Fertilizer F1 F2 F3
Orchad
A 0.571429 1.0 0.0
B 0.428571 0.0 1.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论