2023年7月28日 06:24:27go评论92阅读模式

英文:

pandas get_dummies on rows with multiple entries

问题

如果我有这样的数据框：

Fruits
apple, banana, strawberry
apple
strawberry, apple

我在创建像这样的虚拟列时遇到了问题，因为每一行可能有多种水果。这将是我期望的结果：

apple	banana	strawberry
1	1	1
1	0	0
1	0	1

尝试单独使用get_dummies函数不起作用，因为它会创建以下列：

apple,banana,strawberry	apple	strawberry,apple
1	0	0
0	1	0
0	0	1

感谢任何帮助，谢谢！

英文:

If I have a dataframe like this:

Fruits
apple, banana, strawberry
apple
strawberry, apple

I am having trouble creating dummy columns for something like this, as it may have multiple fruits in each row. This would be my desired outcome:

apple	banana	strawberry
1	1	1
1	0	0
1	0	1

Trying the get_dummies function by itself does not work, since it will create the columns like:

apple,banana,strawberry	apple	strawberry,apple
1	0	0
0	1	0
0	0	1

Any help is appreciated, thank you!

答案1

得分: 2

以下是您要翻译的代码部分：

import pandas as pd
data = [
    ['apple, banana, strawberry'],
    ['apple'],
    ['strawberry, apple']
]
df = pd.DataFrame(data, columns=['Fruits'])
print(df)
columns = set()
for row in df['Fruits'].to_list():
    columns |= set(row.split(', '))
rows = []
for row in df['Fruits'].to_list():
    rows.append([int(c in row) for c in columns])
columns = list(columns)
df = pd.DataFrame(rows, columns=columns)
print(df)

Output:

                      Fruits
0  apple, banana, strawberry
1                      apple
2          strawberry, apple
   apple  banana  strawberry
0      1       1           1
1      1       0           0
2      1       0           1

英文:

Here's one way, using the technique I mentioned in my comment:

import pandas as pd
data = [
    [&#39;apple, banana, strawberry&#39;],
    [&#39;apple&#39;],
    [&#39;strawberry, apple&#39;]
]
df = pd.DataFrame(data, columns=[&#39;Fruits&#39;])
print(df)
columns = set()
for row in df[&#39;Fruits&#39;].to_list():
    columns |= set( row.split(&#39;, &#39;) )
rows = []
for row in df[&#39;Fruits&#39;].to_list():
    rows.append( [int(c in row) for c in columns] )
columns = list(columns)
df = pd.DataFrame(rows, columns=columns)
print(df)

Output:

                      Fruits
0  apple, banana, strawberry
1                      apple
2          strawberry, apple
   apple  banana  strawberry
0      1       1           1
1      1       0           0
2      1       0           1

答案2

得分: 1

以下是翻译好的部分：

dummy_df = df["Fruits"].str.get_dummies(", ")

输出：

      apple banana strawberry
    0     1      1          1
    1     1      0          0
    2     1      0          1

英文:

You can just do this

dummy_df = df[&quot;Fruits&quot;].str.get_dummies(&quot;, &quot;)

Output

  apple banana strawberry
0     1      1          1
1     1      0          0
2     1      0          1

答案3

得分: 0

你还可以使用标准库来获得整洁且漂亮的代码。

from itertools import chain
def create_dummy(df: pd.DataFrame) -> pd.DataFrame:
    # 获取所有唯一水果的列表
    fruits_series = df["Fruits"].str.split(", ")
    unique_fruits = set(chain.from_iterable(fruits_series))
    # 创建虚拟DataFrame
    tdf = pd.DataFrame(0,
                       columns=list(unique_fruits),
                       index=df.index,
                       dtype=pd.UInt8Dtype)
    # 更新标签
    for ind, targets in zip(df.index, fruits_series):
        tdf.loc[ind, targets] = 1
    return tdf
用法：
```python
df = pd.DataFrame(["apple, orange, grape", 
                   "apple", 
                   "banana, strawberry"], columns=["Fruits"])
print(create_dummy(df))

输出：

       grape  strawberry  orange  apple  banana
0         1           0       1      1       0
1         0           0       0      1       0
2         0           1       0      0       1

已在 python==3.10.8 和 pandas==1.5.2 上测试过。

英文:

You can also do so with a help of a standard library to get neat and pretty code.

from itertools import chain
def create_dummy(df: pd.DataFrame) -&gt; pd.DataFrame:
    # Get list of all unique fruits
    fruits_series = df[&quot;Fruits&quot;].str.split(&quot;, &quot;)
    unique_fruits = set(chain.from_iterable(fruits_series))
    # Create dummy DataFrame
    tdf = pd.DataFrame(0,
                       columns=list(unique_fruits),
                       index=df.index,
                       dtype=pd.UInt8Dtype)
    # Update labels
    for ind, targets in zip(df.index, fruits_series):
        tdf.loc[ind, targets] = 1
    return tdf

Usage:

df = pd.DataFrame([&quot;apple, orange, grape&quot;, 
                   &quot;apple&quot;, 
                   &quot;banana, strawberry&quot;], columns=[&quot;Fruits&quot;])
print(create_dummy(df))

Output:

       grape  strawberry  orange apple banana
 0      1           0       1     1      0
 1      0           0       0     1      0
 2      0           1       0     0      1

Tested on python==3.10.8 and pandas==1.5.2.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pandas对具有多个条目的行进行get_dummies操作

问题

答案1

答案2

答案3

Memoisation – 伯努利数

SQL查询以使用Django的模型将交替行合并为单个表。

如何在Pandas中满足特定条件时添加连续数字

Evaluating forward references with typing.get_type_hints in Python for a class defined inside another method/class

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论