2023年6月29日 01:46:45go评论97阅读模式

英文:

Handling categorical features with varying number of categories

问题

我有一个用于分类问题的数据集。其中一些特征是分类特征，我希望以某种方式对它们进行编码，以用于基本的逻辑回归。然而，我的数据部分以长格式存在。例如：

idx  f1   f2   ...
0    123  123
1    456  456
2    789   789
...

df2

idx  f_cat
0    string1
0    string2
0    string3
1    string1
2    string1
2    string2
2    string4

我想要包括的辅助数据框具有特征内的多个类别。此外，它在每个索引中分配多少个类别是不同的（最多16个，但大多数在1-6个之间）。我试图避免使用独热编码，因为基数非常高（即数百个）。这些类别也没有顺序，它们的顺序完全是随机的，因此我不能只截取前 'n' 个类别。您有关于如何对这个分类特征进行编码的建议吗？

顺便说一下，我主要使用Python，但也愿意接受其他基于语言的答案。

英文:

I have a dataset for a classification problem. Some of the features are categorical and I wish to encode them in some way for a basic logistic regression. However, my data is partially in a long format. For example

idx  f1   f2   ...
0    123  123
1    456  456
2    789   789
...

df2

idx  f_cat
0    string1
0    string2
0    string3
1    string1
2    string1
2    string2
2    string4

The secondary dataframe that I want to include has multiple categories within the feature. Furthermore, it varies with respect to how many of the categories are assigned to each index (as many as 16 but majority are in 1-6). I am trying to avoid use one hot encoding as there is very high cardinality (ie. 100s). There is also no order to the categories, the order is completely random and therefore I cannot just truncate to the first 'n' categories. Any suggestions on how I could encode this categorical feature?

FYI, I am primarily using python but happy to accept other language based answers.

答案1

得分: 1

我遇到了这个问题，我对前X个最常见的类别使用了独热编码。

编辑：我找到了我是如何做的代码。

for col in df.columns:
    df[col] = df[col].values.astype(str)
    if col in cat_features:
        unique_vals = df[col].unique()
        if len(unique_vals) <= 10:
            df_encoded = pd.get_dummies(df[col], prefix=col)
            df = pd.concat([df, df_encoded], axis=1)
        else:
            top_10_vals = df[col].value_counts().index[:10]
            for val in top_10_vals:
                col_name = col + '_' + val
                df[col_name] = np.where(df[col] == val, 1, 0)
    df = df.drop(col, axis=1)

英文:

I Came across this issue, I used one hot encoding on the top X most frequent categories.

Edit: I found the code of how I did it.

for col in df.columns :
    df[col]=df[col].values.astype(str)
    if col in cat_features:
        unique_vals = df[col].unique()
        if len(unique_vals) &lt;= 10:
            df_encoded = pd.get_dummies(df[col], prefix=col)
            df = pd.concat([df, df_encoded], axis=1)
        else:
            top_10_vals = df[col].value_counts().index[:10]
            for val in top_10_vals:
                col_name = col + &#39;_&#39; + val
                df[col_name] = np.where(df[col] == val, 1, 0)
       df=df.drop(col,axis=1)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

处理具有不同数量类别的分类特征

问题

答案1

Go：Varint返回的值与读取的值不同。

CNN Pytorch Error : Input type (torch.cuda.ByteTensor) and weight type (torch.cuda.FloatTensor) should be the same

将任务协方差矩阵设置为GPyTorch中的相关矩阵

在Google Colab上安装d2l软件包

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。