如何在同一类别类型的多个列上运行`get_dummies()`函数?

huangapple go评论75阅读模式
英文:

How to run get_dummies() function on multiple columns for the same category type?

问题

当我在这个数据框上运行get_dummies函数时,它会创建四列,分别命名为'Symptom_A_Itching''Symptom_A_Rash''Symptom_B_Rash''Symptom_B_Itching'。我不想将这两个值分开处理,想要执行独热编码,生成包含列'Symptom_Itching''Symptom_Rash'的数据框。

我尝试使用get_dummies函数的columnsprefix参数,但没有产生结果。我还尝试将所有Symptom列的名称都设置为'Symptom',而不是'Symptom_A''Symptom_B',但这也没有起作用。

这是我现在的代码:

data_frame: DataFrame = read_csv('dataset.csv')
features: DataFrame = data_frame.iloc[:, 1:]
features.fillna('')
x: DataFrame = get_dummies(features)
英文:

I have features DataFrame that (let us say) looks like this:

Symptom A Symptom B
Itching Rash
Rash Itching

When I run the get_dummies function on this dataframe, it will create four columns named 'Symptom_A_Itching', 'Symptom_A_Rash', 'Symptom_B_Rash', 'Symptom_B_Itching'. I don't want to treat the two values separately as it is being done with this function.

Is there any way to perform one hot encoding for this dataframe, where the values of both these columns won't be treated separately.

Basically, I want to get a DataFrame with columns 'Symptom_Itching', 'Symptom_Rash'.

I tried using the columns and prefix arguments in the get_dummies function, but that did not produce any results. I also tried setting all the Symptom column names to just 'Symptom' instead of 'Symptom_A', 'Symptom_B', but that also didn't work.

This is the code I have:

data_frame: DataFrame = read_csv('dataset.csv')
features: DataFrame = data_frame.iloc[:, 1:]
features.fillna('')
x: DataFrame = get_dummies(features)

答案1

得分: 2

[`stack`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html),然后使用 [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html) 和 [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html):
```python
out = (df
   .stack().str.get_dummies()
   .groupby(level=0).max()
 )

或者使用一个小技巧,以相同的名称获取所有输出列,并在 axis=1 上使用 groupby.max()

out = (pd.get_dummies(df.rename(columns=lambda x: ''), prefix_sep='')
         .groupby(level=0, axis=1).max()
       )

输出:

   Itching  Rash
0        1     1
1        1     1

<details>
<summary>英文:</summary>

[`stack`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html), then [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html) and [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html):

out = (df
.stack().str.get_dummies()
.groupby(level=0).max()
)

Or using a trick to get all output columns with the same name and [`groupby.max()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html) on `axis=1`:

out = (pd.get_dummies(df.rename(columns=lambda x: ''), prefix_sep='')
.groupby(level=0, axis=1).max()
)


Output:

Itching Rash
0 1 1
1 1 1


</details>



# 答案2
**得分**: 0

你可以使用[pandas.DataFrame.drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)来删除列。根据文档:

**pandas.DataFrame.drop**

> 从行或列中删除指定的标签。
> 
> 通过指定标签名称和相应的轴来删除行或列,或者通过直接指定索引或列名称来删除。在使用多级索引时,可以通过指定级别来删除不同级别上的标签。

对于给定的示例,你可以尝试以下代码(你需要根据你的CSV解析适应这个方法):

```python
import pandas as pd

df = pd.DataFrame(
    {   
        'SymptomA': ['Itching', 'Rash'],
        'SymptomB': ['Rash', 'Itching']
    })
df_onehot = pd.get_dummies(df['SymptomA'])
df = df.drop('SymptomA', axis=1)
df = df.drop('SymptomB', axis=1)
df = df.join(df_onehot)
print(df)

# 输出:

#    Itching   Rash
# 0     True  False
# 1    False   True

请注意,这是一个示例,你需要根据你的具体情况进行适应。

英文:

You can use pandas.DataFrame.drop to drop columns. From documentation:

pandas.DataFrame.drop

> Drop specified labels from rows or columns.
>
> Remove rows or columns by specifying label names and corresponding
> axis, or by specifying directly index or column names. When using a
> multi-index, labels on different levels can be removed by specifying
> the level.

For the example given, you can try (you need to adapt this approach for your csv parsing):

import pandas as pd

df = pd.DataFrame(
    {   
        &#39;SymptomA&#39;: [&#39;Itching&#39;, &#39;Rash&#39;],
        &#39;SymptomB&#39;: [&#39;Rash&#39;, &#39;Itching&#39;]
    })
df_onehot = pd.get_dummies(df[&#39;SymptomA&#39;])
df = df.drop(&#39;SymptomA&#39;, axis=1)
df = df.drop(&#39;SymptomB&#39;, axis=1)
df = df.join(df_onehot)
print(df)

# Output:

#    Itching   Rash
# 0     True  False
# 1    False   True

huangapple
  • 本文由 发表于 2023年8月5日 13:27:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76840259.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定