英文:
How to run get_dummies() function on multiple columns for the same category type?
问题
当我在这个数据框上运行get_dummies
函数时,它会创建四列,分别命名为'Symptom_A_Itching'
、'Symptom_A_Rash'
、'Symptom_B_Rash'
和'Symptom_B_Itching'
。我不想将这两个值分开处理,想要执行独热编码,生成包含列'Symptom_Itching'
和'Symptom_Rash'
的数据框。
我尝试使用get_dummies
函数的columns
和prefix
参数,但没有产生结果。我还尝试将所有Symptom列的名称都设置为'Symptom'
,而不是'Symptom_A'
和'Symptom_B'
,但这也没有起作用。
这是我现在的代码:
data_frame: DataFrame = read_csv('dataset.csv')
features: DataFrame = data_frame.iloc[:, 1:]
features.fillna('')
x: DataFrame = get_dummies(features)
英文:
I have features DataFrame that (let us say) looks like this:
Symptom A | Symptom B |
---|---|
Itching | Rash |
Rash | Itching |
When I run the get_dummies function on this dataframe, it will create four columns named 'Symptom_A_Itching', 'Symptom_A_Rash', 'Symptom_B_Rash', 'Symptom_B_Itching'
. I don't want to treat the two values separately as it is being done with this function.
Is there any way to perform one hot encoding for this dataframe, where the values of both these columns won't be treated separately.
Basically, I want to get a DataFrame with columns 'Symptom_Itching', 'Symptom_Rash'
.
I tried using the columns and prefix arguments in the get_dummies function, but that did not produce any results. I also tried setting all the Symptom column names to just 'Symptom'
instead of 'Symptom_A', 'Symptom_B'
, but that also didn't work.
This is the code I have:
data_frame: DataFrame = read_csv('dataset.csv')
features: DataFrame = data_frame.iloc[:, 1:]
features.fillna('')
x: DataFrame = get_dummies(features)
答案1
得分: 2
[`stack`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html),然后使用 [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html) 和 [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html):
```python
out = (df
.stack().str.get_dummies()
.groupby(level=0).max()
)
或者使用一个小技巧,以相同的名称获取所有输出列,并在 axis=1
上使用 groupby.max()
:
out = (pd.get_dummies(df.rename(columns=lambda x: ''), prefix_sep='')
.groupby(level=0, axis=1).max()
)
输出:
Itching Rash
0 1 1
1 1 1
<details>
<summary>英文:</summary>
[`stack`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html), then [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html) and [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html):
out = (df
.stack().str.get_dummies()
.groupby(level=0).max()
)
Or using a trick to get all output columns with the same name and [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html) on `axis=1`:
out = (pd.get_dummies(df.rename(columns=lambda x: ''), prefix_sep='')
.groupby(level=0, axis=1).max()
)
Output:
Itching Rash
0 1 1
1 1 1
</details>
# 答案2
**得分**: 0
你可以使用[pandas.DataFrame.drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)来删除列。根据文档:
**pandas.DataFrame.drop**
> 从行或列中删除指定的标签。
>
> 通过指定标签名称和相应的轴来删除行或列,或者通过直接指定索引或列名称来删除。在使用多级索引时,可以通过指定级别来删除不同级别上的标签。
对于给定的示例,你可以尝试以下代码(你需要根据你的CSV解析适应这个方法):
```python
import pandas as pd
df = pd.DataFrame(
{
'SymptomA': ['Itching', 'Rash'],
'SymptomB': ['Rash', 'Itching']
})
df_onehot = pd.get_dummies(df['SymptomA'])
df = df.drop('SymptomA', axis=1)
df = df.drop('SymptomB', axis=1)
df = df.join(df_onehot)
print(df)
# 输出:
# Itching Rash
# 0 True False
# 1 False True
请注意,这是一个示例,你需要根据你的具体情况进行适应。
英文:
You can use pandas.DataFrame.drop to drop columns. From documentation:
pandas.DataFrame.drop
> Drop specified labels from rows or columns.
>
> Remove rows or columns by specifying label names and corresponding
> axis, or by specifying directly index or column names. When using a
> multi-index, labels on different levels can be removed by specifying
> the level.
For the example given, you can try (you need to adapt this approach for your csv parsing):
import pandas as pd
df = pd.DataFrame(
{
'SymptomA': ['Itching', 'Rash'],
'SymptomB': ['Rash', 'Itching']
})
df_onehot = pd.get_dummies(df['SymptomA'])
df = df.drop('SymptomA', axis=1)
df = df.drop('SymptomB', axis=1)
df = df.join(df_onehot)
print(df)
# Output:
# Itching Rash
# 0 True False
# 1 False True
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论