英文:
pandas get_dummies on rows with multiple entries
问题
如果我有这样的数据框:
Fruits |
---|
apple, banana, strawberry |
apple |
strawberry, apple |
我在创建像这样的虚拟列时遇到了问题,因为每一行可能有多种水果。这将是我期望的结果:
apple | banana | strawberry |
---|---|---|
1 | 1 | 1 |
1 | 0 | 0 |
1 | 0 | 1 |
尝试单独使用get_dummies
函数不起作用,因为它会创建以下列:
apple,banana,strawberry | apple | strawberry,apple |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
感谢任何帮助,谢谢!
英文:
If I have a dataframe like this:
Fruits |
---|
apple, banana, strawberry |
apple |
strawberry, apple |
I am having trouble creating dummy columns for something like this, as it may have multiple fruits in each row. This would be my desired outcome:
apple | banana | strawberry |
---|---|---|
1 | 1 | 1 |
1 | 0 | 0 |
1 | 0 | 1 |
Trying the get_dummies function by itself does not work, since it will create the columns like:
apple,banana,strawberry | apple | strawberry,apple |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
Any help is appreciated, thank you!
答案1
得分: 2
以下是您要翻译的代码部分:
import pandas as pd
data = [
['apple, banana, strawberry'],
['apple'],
['strawberry, apple']
]
df = pd.DataFrame(data, columns=['Fruits'])
print(df)
columns = set()
for row in df['Fruits'].to_list():
columns |= set(row.split(', '))
rows = []
for row in df['Fruits'].to_list():
rows.append([int(c in row) for c in columns])
columns = list(columns)
df = pd.DataFrame(rows, columns=columns)
print(df)
Output:
Fruits
0 apple, banana, strawberry
1 apple
2 strawberry, apple
apple banana strawberry
0 1 1 1
1 1 0 0
2 1 0 1
英文:
Here's one way, using the technique I mentioned in my comment:
import pandas as pd
data = [
['apple, banana, strawberry'],
['apple'],
['strawberry, apple']
]
df = pd.DataFrame(data, columns=['Fruits'])
print(df)
columns = set()
for row in df['Fruits'].to_list():
columns |= set( row.split(', ') )
rows = []
for row in df['Fruits'].to_list():
rows.append( [int(c in row) for c in columns] )
columns = list(columns)
df = pd.DataFrame(rows, columns=columns)
print(df)
Output:
Fruits
0 apple, banana, strawberry
1 apple
2 strawberry, apple
apple banana strawberry
0 1 1 1
1 1 0 0
2 1 0 1
答案2
得分: 1
以下是翻译好的部分:
dummy_df = df["Fruits"].str.get_dummies(", ")
输出:
apple banana strawberry
0 1 1 1
1 1 0 0
2 1 0 1
英文:
You can just do this
dummy_df = df["Fruits"].str.get_dummies(", ")
Output
apple banana strawberry
0 1 1 1
1 1 0 0
2 1 0 1
答案3
得分: 0
你还可以使用标准库来获得整洁且漂亮的代码。
from itertools import chain
def create_dummy(df: pd.DataFrame) -> pd.DataFrame:
# 获取所有唯一水果的列表
fruits_series = df["Fruits"].str.split(", ")
unique_fruits = set(chain.from_iterable(fruits_series))
# 创建虚拟DataFrame
tdf = pd.DataFrame(0,
columns=list(unique_fruits),
index=df.index,
dtype=pd.UInt8Dtype)
# 更新标签
for ind, targets in zip(df.index, fruits_series):
tdf.loc[ind, targets] = 1
return tdf
用法:
```python
df = pd.DataFrame(["apple, orange, grape",
"apple",
"banana, strawberry"], columns=["Fruits"])
print(create_dummy(df))
输出:
grape strawberry orange apple banana
0 1 0 1 1 0
1 0 0 0 1 0
2 0 1 0 0 1
已在 python==3.10.8
和 pandas==1.5.2
上测试过。
英文:
You can also do so with a help of a standard library to get neat and pretty code.
from itertools import chain
def create_dummy(df: pd.DataFrame) -> pd.DataFrame:
# Get list of all unique fruits
fruits_series = df["Fruits"].str.split(", ")
unique_fruits = set(chain.from_iterable(fruits_series))
# Create dummy DataFrame
tdf = pd.DataFrame(0,
columns=list(unique_fruits),
index=df.index,
dtype=pd.UInt8Dtype)
# Update labels
for ind, targets in zip(df.index, fruits_series):
tdf.loc[ind, targets] = 1
return tdf
Usage:
df = pd.DataFrame(["apple, orange, grape",
"apple",
"banana, strawberry"], columns=["Fruits"])
print(create_dummy(df))
Output:
grape strawberry orange apple banana
0 1 0 1 1 0
1 0 0 0 1 0
2 0 1 0 0 1
Tested on python==3.10.8
and pandas==1.5.2
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论