pandas对具有多个条目的行进行get_dummies操作

huangapple go评论55阅读模式
英文:

pandas get_dummies on rows with multiple entries

问题

如果我有这样的数据框:

Fruits
apple, banana, strawberry
apple
strawberry, apple

我在创建像这样的虚拟列时遇到了问题,因为每一行可能有多种水果。这将是我期望的结果:

apple banana strawberry
1 1 1
1 0 0
1 0 1

尝试单独使用get_dummies函数不起作用,因为它会创建以下列:

apple,banana,strawberry apple strawberry,apple
1 0 0
0 1 0
0 0 1

感谢任何帮助,谢谢!

英文:

If I have a dataframe like this:

Fruits
apple, banana, strawberry
apple
strawberry, apple

I am having trouble creating dummy columns for something like this, as it may have multiple fruits in each row. This would be my desired outcome:

apple banana strawberry
1 1 1
1 0 0
1 0 1

Trying the get_dummies function by itself does not work, since it will create the columns like:

apple,banana,strawberry apple strawberry,apple
1 0 0
0 1 0
0 0 1

Any help is appreciated, thank you!

答案1

得分: 2

以下是您要翻译的代码部分:

import pandas as pd
data = [
    ['apple, banana, strawberry'],
    ['apple'],
    ['strawberry, apple']
]

df = pd.DataFrame(data, columns=['Fruits'])
print(df)

columns = set()
for row in df['Fruits'].to_list():
    columns |= set(row.split(', '))

rows = []
for row in df['Fruits'].to_list():
    rows.append([int(c in row) for c in columns])

columns = list(columns)
df = pd.DataFrame(rows, columns=columns)
print(df)

Output:

                      Fruits
0  apple, banana, strawberry
1                      apple
2          strawberry, apple
   apple  banana  strawberry
0      1       1           1
1      1       0           0
2      1       0           1
英文:

Here's one way, using the technique I mentioned in my comment:

import pandas as pd
data = [
    ['apple, banana, strawberry'],
    ['apple'],
    ['strawberry, apple']
]

df = pd.DataFrame(data, columns=['Fruits'])
print(df)

columns = set()
for row in df['Fruits'].to_list():
    columns |= set( row.split(', ') )

rows = []
for row in df['Fruits'].to_list():
    rows.append( [int(c in row) for c in columns] )

columns = list(columns)
df = pd.DataFrame(rows, columns=columns)
print(df)

Output:

                      Fruits
0  apple, banana, strawberry
1                      apple
2          strawberry, apple
   apple  banana  strawberry
0      1       1           1
1      1       0           0
2      1       0           1

答案2

得分: 1

以下是翻译好的部分:

dummy_df = df["Fruits"].str.get_dummies(", ")

输出:

      apple banana strawberry
    0     1      1          1
    1     1      0          0
    2     1      0          1
英文:

You can just do this

dummy_df = df["Fruits"].str.get_dummies(", ")

Output

  apple banana strawberry
0     1      1          1
1     1      0          0
2     1      0          1

答案3

得分: 0

你还可以使用标准库来获得整洁且漂亮的代码。

from itertools import chain

def create_dummy(df: pd.DataFrame) -> pd.DataFrame:
    # 获取所有唯一水果的列表
    fruits_series = df["Fruits"].str.split(", ")
    unique_fruits = set(chain.from_iterable(fruits_series))

    # 创建虚拟DataFrame
    tdf = pd.DataFrame(0,
                       columns=list(unique_fruits),
                       index=df.index,
                       dtype=pd.UInt8Dtype)

    # 更新标签
    for ind, targets in zip(df.index, fruits_series):
        tdf.loc[ind, targets] = 1

    return tdf

用法

```python
df = pd.DataFrame(["apple, orange, grape", 
                   "apple", 
                   "banana, strawberry"], columns=["Fruits"])
print(create_dummy(df))

输出:

       grape  strawberry  orange  apple  banana
0         1           0       1      1       0
1         0           0       0      1       0
2         0           1       0      0       1

已在 python==3.10.8pandas==1.5.2 上测试过。

英文:

You can also do so with a help of a standard library to get neat and pretty code.

from itertools import chain

def create_dummy(df: pd.DataFrame) -> pd.DataFrame:
    # Get list of all unique fruits
    fruits_series = df["Fruits"].str.split(", ")
    unique_fruits = set(chain.from_iterable(fruits_series))

    # Create dummy DataFrame
    tdf = pd.DataFrame(0,
                       columns=list(unique_fruits),
                       index=df.index,
                       dtype=pd.UInt8Dtype)

    # Update labels
    for ind, targets in zip(df.index, fruits_series):
        tdf.loc[ind, targets] = 1

    return tdf

Usage:

df = pd.DataFrame(["apple, orange, grape", 
                   "apple", 
                   "banana, strawberry"], columns=["Fruits"])
print(create_dummy(df))

Output:

       grape  strawberry  orange apple banana
 0      1           0       1     1      0
 1      0           0       0     1      0
 2      0           1       0     0      1

Tested on python==3.10.8 and pandas==1.5.2.

huangapple
  • 本文由 发表于 2023年7月28日 06:24:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76783764.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定