2023年8月5日 13:27:38go评论99阅读模式

英文:

How to run get_dummies() function on multiple columns for the same category type?

问题

当我在这个数据框上运行get_dummies函数时，它会创建四列，分别命名为'Symptom_A_Itching'、'Symptom_A_Rash'、'Symptom_B_Rash'和'Symptom_B_Itching'。我不想将这两个值分开处理，想要执行独热编码，生成包含列'Symptom_Itching'和'Symptom_Rash'的数据框。

我尝试使用get_dummies函数的columns和prefix参数，但没有产生结果。我还尝试将所有Symptom列的名称都设置为'Symptom'，而不是'Symptom_A'和'Symptom_B'，但这也没有起作用。

这是我现在的代码：

data_frame: DataFrame = read_csv('dataset.csv')
features: DataFrame = data_frame.iloc[:, 1:]
features.fillna('')
x: DataFrame = get_dummies(features)

英文:

I have features DataFrame that (let us say) looks like this:

Symptom A	Symptom B
Itching	Rash
Rash	Itching

When I run the get_dummies function on this dataframe, it will create four columns named 'Symptom_A_Itching', 'Symptom_A_Rash', 'Symptom_B_Rash', 'Symptom_B_Itching'. I don't want to treat the two values separately as it is being done with this function.

Is there any way to perform one hot encoding for this dataframe, where the values of both these columns won't be treated separately.

Basically, I want to get a DataFrame with columns 'Symptom_Itching', 'Symptom_Rash'.

I tried using the columns and prefix arguments in the get_dummies function, but that did not produce any results. I also tried setting all the Symptom column names to just 'Symptom' instead of 'Symptom_A', 'Symptom_B', but that also didn't work.

This is the code I have:

data_frame: DataFrame = read_csv(&#39;dataset.csv&#39;)
features: DataFrame = data_frame.iloc[:, 1:]
features.fillna(&#39;&#39;)
x: DataFrame = get_dummies(features)

答案1

得分: 2

[`stack`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html)，然后使用 [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html) 和 [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html)：
```python
out = (df
   .stack().str.get_dummies()
   .groupby(level=0).max()
 )

或者使用一个小技巧，以相同的名称获取所有输出列，并在 axis=1 上使用 groupby.max()：

out = (pd.get_dummies(df.rename(columns=lambda x: ''), prefix_sep='')
         .groupby(level=0, axis=1).max()
       )

输出：

   Itching  Rash
0        1     1
1        1     1


<details>
<summary>英文:</summary>
[`stack`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html), then [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html) and [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html):

out = (df
.stack().str.get_dummies()
.groupby(level=0).max()
)

Or using a trick to get all output columns with the same name and [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html) on `axis=1`:

out = (pd.get_dummies(df.rename(columns=lambda x: ''), prefix_sep='')
.groupby(level=0, axis=1).max()
)


Output:

Itching Rash
0 1 1
1 1 1


</details>
# 答案2
**得分**: 0
你可以使用[pandas.DataFrame.drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)来删除列。根据文档：
**pandas.DataFrame.drop**
> 从行或列中删除指定的标签。
> 
> 通过指定标签名称和相应的轴来删除行或列，或者通过直接指定索引或列名称来删除。在使用多级索引时，可以通过指定级别来删除不同级别上的标签。
对于给定的示例，你可以尝试以下代码（你需要根据你的CSV解析适应这个方法）：
```python
import pandas as pd
df = pd.DataFrame(
    {   
        'SymptomA': ['Itching', 'Rash'],
        'SymptomB': ['Rash', 'Itching']
    })
df_onehot = pd.get_dummies(df['SymptomA'])
df = df.drop('SymptomA', axis=1)
df = df.drop('SymptomB', axis=1)
df = df.join(df_onehot)
print(df)
# 输出:
#    Itching   Rash
# 0     True  False
# 1    False   True

请注意，这是一个示例，你需要根据你的具体情况进行适应。

英文:

You can use pandas.DataFrame.drop to drop columns. From documentation:

pandas.DataFrame.drop

> Drop specified labels from rows or columns.
>
> Remove rows or columns by specifying label names and corresponding
> axis, or by specifying directly index or column names. When using a
> multi-index, labels on different levels can be removed by specifying
> the level.

For the example given, you can try (you need to adapt this approach for your csv parsing):

import pandas as pd
df = pd.DataFrame(
    {   
        &#39;SymptomA&#39;: [&#39;Itching&#39;, &#39;Rash&#39;],
        &#39;SymptomB&#39;: [&#39;Rash&#39;, &#39;Itching&#39;]
    })
df_onehot = pd.get_dummies(df[&#39;SymptomA&#39;])
df = df.drop(&#39;SymptomA&#39;, axis=1)
df = df.drop(&#39;SymptomB&#39;, axis=1)
df = df.join(df_onehot)
print(df)
# Output:
#    Itching   Rash
# 0     True  False
# 1    False   True

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在同一类别类型的多个列上运行`get_dummies()`函数？

问题

答案1

用Python Pillow库裁剪/模糊.png图像，而不改变其他任何内容。

如何使用正则表达式打印重复的输出，它只打印第一个匹配项。

How do I make a for loop that has range in it loop back to the top of the loop without changing the range variable? I have code

提取Python中文件内容作为变量。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。