英文:
Handling categorical features with varying number of categories
问题
我有一个用于分类问题的数据集。其中一些特征是分类特征,我希望以某种方式对它们进行编码,以用于基本的逻辑回归。然而,我的数据部分以长格式存在。例如:
df
idx f1 f2 ...
0 123 123
1 456 456
2 789 789
...
df2
idx f_cat
0 string1
0 string2
0 string3
1 string1
2 string1
2 string2
2 string4
我想要包括的辅助数据框具有特征内的多个类别。此外,它在每个索引中分配多少个类别是不同的(最多16个,但大多数在1-6个之间)。我试图避免使用独热编码,因为基数非常高(即数百个)。这些类别也没有顺序,它们的顺序完全是随机的,因此我不能只截取前 'n' 个类别。您有关于如何对这个分类特征进行编码的建议吗?
顺便说一下,我主要使用Python,但也愿意接受其他基于语言的答案。
英文:
I have a dataset for a classification problem. Some of the features are categorical and I wish to encode them in some way for a basic logistic regression. However, my data is partially in a long format. For example
df
idx f1 f2 ...
0 123 123
1 456 456
2 789 789
...
df2
idx f_cat
0 string1
0 string2
0 string3
1 string1
2 string1
2 string2
2 string4
The secondary dataframe that I want to include has multiple categories within the feature. Furthermore, it varies with respect to how many of the categories are assigned to each index (as many as 16 but majority are in 1-6). I am trying to avoid use one hot encoding as there is very high cardinality (ie. 100s). There is also no order to the categories, the order is completely random and therefore I cannot just truncate to the first 'n' categories. Any suggestions on how I could encode this categorical feature?
FYI, I am primarily using python but happy to accept other language based answers.
答案1
得分: 1
我遇到了这个问题,我对前X个最常见的类别使用了独热编码。
编辑:我找到了我是如何做的代码。
for col in df.columns:
df[col] = df[col].values.astype(str)
if col in cat_features:
unique_vals = df[col].unique()
if len(unique_vals) <= 10:
df_encoded = pd.get_dummies(df[col], prefix=col)
df = pd.concat([df, df_encoded], axis=1)
else:
top_10_vals = df[col].value_counts().index[:10]
for val in top_10_vals:
col_name = col + '_' + val
df[col_name] = np.where(df[col] == val, 1, 0)
df = df.drop(col, axis=1)
英文:
I Came across this issue, I used one hot encoding on the top X most frequent categories.
Edit: I found the code of how I did it.
for col in df.columns :
df[col]=df[col].values.astype(str)
if col in cat_features:
unique_vals = df[col].unique()
if len(unique_vals) <= 10:
df_encoded = pd.get_dummies(df[col], prefix=col)
df = pd.concat([df, df_encoded], axis=1)
else:
top_10_vals = df[col].value_counts().index[:10]
for val in top_10_vals:
col_name = col + '_' + val
df[col_name] = np.where(df[col] == val, 1, 0)
df=df.drop(col,axis=1)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论