pandas对具有多个条目的行进行get_dummies操作

huangapple go评论92阅读模式
英文:

pandas get_dummies on rows with multiple entries

问题

如果我有这样的数据框:

Fruits
apple, banana, strawberry
apple
strawberry, apple

我在创建像这样的虚拟列时遇到了问题,因为每一行可能有多种水果。这将是我期望的结果:

apple banana strawberry
1 1 1
1 0 0
1 0 1

尝试单独使用get_dummies函数不起作用,因为它会创建以下列:

apple,banana,strawberry apple strawberry,apple
1 0 0
0 1 0
0 0 1

感谢任何帮助,谢谢!

英文:

If I have a dataframe like this:

Fruits
apple, banana, strawberry
apple
strawberry, apple

I am having trouble creating dummy columns for something like this, as it may have multiple fruits in each row. This would be my desired outcome:

apple banana strawberry
1 1 1
1 0 0
1 0 1

Trying the get_dummies function by itself does not work, since it will create the columns like:

apple,banana,strawberry apple strawberry,apple
1 0 0
0 1 0
0 0 1

Any help is appreciated, thank you!

答案1

得分: 2

以下是您要翻译的代码部分:

  1. import pandas as pd
  2. data = [
  3. ['apple, banana, strawberry'],
  4. ['apple'],
  5. ['strawberry, apple']
  6. ]
  7. df = pd.DataFrame(data, columns=['Fruits'])
  8. print(df)
  9. columns = set()
  10. for row in df['Fruits'].to_list():
  11. columns |= set(row.split(', '))
  12. rows = []
  13. for row in df['Fruits'].to_list():
  14. rows.append([int(c in row) for c in columns])
  15. columns = list(columns)
  16. df = pd.DataFrame(rows, columns=columns)
  17. print(df)

Output:

  1. Fruits
  2. 0 apple, banana, strawberry
  3. 1 apple
  4. 2 strawberry, apple
  5. apple banana strawberry
  6. 0 1 1 1
  7. 1 1 0 0
  8. 2 1 0 1
英文:

Here's one way, using the technique I mentioned in my comment:

  1. import pandas as pd
  2. data = [
  3. ['apple, banana, strawberry'],
  4. ['apple'],
  5. ['strawberry, apple']
  6. ]
  7. df = pd.DataFrame(data, columns=['Fruits'])
  8. print(df)
  9. columns = set()
  10. for row in df['Fruits'].to_list():
  11. columns |= set( row.split(', ') )
  12. rows = []
  13. for row in df['Fruits'].to_list():
  14. rows.append( [int(c in row) for c in columns] )
  15. columns = list(columns)
  16. df = pd.DataFrame(rows, columns=columns)
  17. print(df)

Output:

  1. Fruits
  2. 0 apple, banana, strawberry
  3. 1 apple
  4. 2 strawberry, apple
  5. apple banana strawberry
  6. 0 1 1 1
  7. 1 1 0 0
  8. 2 1 0 1

答案2

得分: 1

以下是翻译好的部分:

  1. dummy_df = df["Fruits"].str.get_dummies(", ")

输出:

  1. apple banana strawberry
  2. 0 1 1 1
  3. 1 1 0 0
  4. 2 1 0 1
英文:

You can just do this

  1. dummy_df = df["Fruits"].str.get_dummies(", ")

Output

  1. apple banana strawberry
  2. 0 1 1 1
  3. 1 1 0 0
  4. 2 1 0 1

答案3

得分: 0

你还可以使用标准库来获得整洁且漂亮的代码。

  1. from itertools import chain
  2. def create_dummy(df: pd.DataFrame) -> pd.DataFrame:
  3. # 获取所有唯一水果的列表
  4. fruits_series = df["Fruits"].str.split(", ")
  5. unique_fruits = set(chain.from_iterable(fruits_series))
  6. # 创建虚拟DataFrame
  7. tdf = pd.DataFrame(0,
  8. columns=list(unique_fruits),
  9. index=df.index,
  10. dtype=pd.UInt8Dtype)
  11. # 更新标签
  12. for ind, targets in zip(df.index, fruits_series):
  13. tdf.loc[ind, targets] = 1
  14. return tdf
  15. 用法
  16. ```python
  17. df = pd.DataFrame(["apple, orange, grape",
  18. "apple",
  19. "banana, strawberry"], columns=["Fruits"])
  20. print(create_dummy(df))

输出:

  1. grape strawberry orange apple banana
  2. 0 1 0 1 1 0
  3. 1 0 0 0 1 0
  4. 2 0 1 0 0 1

已在 python==3.10.8pandas==1.5.2 上测试过。

英文:

You can also do so with a help of a standard library to get neat and pretty code.

  1. from itertools import chain
  2. def create_dummy(df: pd.DataFrame) -> pd.DataFrame:
  3. # Get list of all unique fruits
  4. fruits_series = df["Fruits"].str.split(", ")
  5. unique_fruits = set(chain.from_iterable(fruits_series))
  6. # Create dummy DataFrame
  7. tdf = pd.DataFrame(0,
  8. columns=list(unique_fruits),
  9. index=df.index,
  10. dtype=pd.UInt8Dtype)
  11. # Update labels
  12. for ind, targets in zip(df.index, fruits_series):
  13. tdf.loc[ind, targets] = 1
  14. return tdf

Usage:

  1. df = pd.DataFrame(["apple, orange, grape",
  2. "apple",
  3. "banana, strawberry"], columns=["Fruits"])
  4. print(create_dummy(df))

Output:

  1. grape strawberry orange apple banana
  2. 0 1 0 1 1 0
  3. 1 0 0 0 1 0
  4. 2 0 1 0 0 1

Tested on python==3.10.8 and pandas==1.5.2.

huangapple
  • 本文由 发表于 2023年7月28日 06:24:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76783764.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定