2023年5月26日 00:02:39go评论102阅读模式

英文:

Pandas: Apply function to each group and store result in new column

问题

你可以尝试以下代码来实现你的需求：

import pandas as pd
import numpy as np
item_df = pd.DataFrame({'BarCode': ['12345678AAAA', '12345678BBBB', '12345678CCCC',
                                     '12345678ABCD', '12345678EFGH', '12345678IJKL',
                                     '67890123XXXX', '67890123YYYY', '67890123ZZZZ',
                                     '67890123ABCD', '67890123EFGH', '67890123IJKL'],
                        'Extracted_Code': ['12345678','12345678', '12345678','12345678','12345678','12345678',
                                           '67890123','67890123', '67890123','67890123','67890123','67890123'],
                        'Description': ['Fruits', 'Fruits', 'Fruits', 'Apples', 'Oranges', 'Mangoes',
                                        'Snacks', 'Snacks', 'Snacks', 'Yoghurt', 'Cookies', 'Oats'],
                        'Category': ['H', 'H', 'H', 'M', 'T', 'S', 'H', 'H', 'H', 'M', 'M', 'F'],
                        'Code': ['0', '2', '3', '1', '2', '4', '0', '2', '3', '3', '4', '2'],
                        'Quantity': [99, 77, 10, 52, 11, 90, 99, 77, 10, 52, 11, 90],
                        'Price': [12.0, 10.5, 11.0, 15.6, 12.9, 67.0, 12.0, 10.5, 11.0, 15.6, 12.9, 67.0]})
item_df = item_df.sort_values(by=['Extracted_Code', 'Category', 'Code'])
item_df['Combined'] = np.NaN
def create_combined(row, group):
    if row['Category'] == 'H':
        return np.NaN
    else:
        group_h = group[group['Category'] == 'H']
        group_h = group_h[group_h['Code'] <= row['Code']]
        return group_h.to_dict('records')
item_df['Combined'] = item_df.groupby(['Extracted_Code']).apply(lambda group: group.apply(lambda row: create_combined(row, group), axis=1)).reset_index(drop=True)
print(item_df)

这段代码将为每个组应用条件并创建Combined列。希望这可以满足你的需求。

英文:

I have an item dataframe such as:

item_df = pd.DataFrame({&#39;BarCode&#39;: [&#39;12345678AAAA&#39;, &#39;12345678BBBB&#39;, &#39;12345678CCCC&#39;,
&#39;12345678ABCD&#39;, &#39;12345678EFGH&#39;, &#39;12345678IJKL&#39;,
&#39;67890123XXXX&#39;, &#39;67890123YYYY&#39;, &#39;67890123ZZZZ&#39;,
&#39;67890123ABCD&#39;, &#39;67890123EFGH&#39;, &#39;67890123IJKL&#39;],
&#39;Extracted_Code&#39;: [&#39;12345678&#39;,&#39;12345678&#39;, &#39;12345678&#39;,&#39;12345678&#39;,&#39;12345678&#39;,&#39;12345678&#39;,
&#39;67890123&#39;,&#39;67890123&#39;, &#39;67890123&#39;,&#39;67890123&#39;, &#39;67890123&#39;,&#39;67890123&#39;],
&#39;Description&#39;: [&#39;Fruits&#39;, &#39;Fruits&#39;, &#39;Fruits&#39;, &#39;Apples&#39;, &#39;Oranges&#39;, &#39;Mangoes&#39;,
&#39;Snacks&#39;, &#39;Snacks&#39;, &#39;Snacks&#39;, &#39;Yoghurt&#39;, &#39;Cookies&#39;, &#39;Oats&#39;],
&#39;Category&#39;: [&#39;H&#39;, &#39;H&#39;, &#39;H&#39;, &#39;M&#39;, &#39;T&#39;, &#39;S&#39;, &#39;H&#39;, &#39;H&#39;, &#39;H&#39;, &#39;M&#39;, &#39;M&#39;, &#39;F&#39;],
&#39;Code&#39;: [&#39;0&#39;, &#39;2&#39;, &#39;3&#39;, &#39;1&#39;, &#39;2&#39;, &#39;4&#39;, &#39;0&#39;, &#39;2&#39;, &#39;3&#39;, &#39;3&#39;, &#39;4&#39;, &#39;2&#39;],
&#39;Quantity&#39;: [99, 77, 10, 52, 11, 90, 99, 77, 10, 52, 11, 90],
&#39;Price&#39;: [12.0, 10.5, 11.0, 15.6, 12.9, 67.0, 12.0, 10.5, 11.0, 15.6, 12.9, 67.0]})
item_df = item_df.sort_values(by=[&#39;Extracted_Code&#39;, &#39;Category&#39;, &#39;Code&#39;])
item_df[&#39;Combined&#39;] = np.NaN

What I am trying to achieve is a bit tricky. I have to perform groupby on ['Extracted_Code'] and for each group, create a new column Combined. The column Combined will have value based on:

For rows with Category='H', Combined will have NaN values.
For rows with Category other than 'H', suppose if we take a row with Category='M', then Combined column of that particular row will have a list of row jsons that has Category='H' in the same group and whose Code is less than or equal to Code of that particular row.

My desired result is:

  BarCode        Extracted_Code   Description   Category   Code    Quantity   Price    Combined
0 12345678AAAA   12345678         Fruits        H          0       99         12.0     NaN
1 12345678BBBB   12345678         Fruits        H          2       77         10.5     NaN
2 12345678CCCC   12345678         Fruits        H          3       10         11.0     NaN
3 12345678ABCD   12345678         Apples        M          1       52         15.6     [{&#39;BarCode&#39;: &#39;12345678AAAA&#39;, &#39;Description&#39;: &#39;Fruits&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;0&#39;, &#39;Quantity&#39;: 99, &#39;Price&#39;: 12.0}]
4 12345678IJKL   12345678         Mangoes       S          4       90         67.0     [{&#39;BarCode&#39;: &#39;12345678AAAA&#39;, &#39;Description&#39;: &#39;Fruits&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;0&#39;, &#39;Quantity&#39;: 99, &#39;Price&#39;: 12.0},
{&#39;BarCode&#39;: &#39;12345678BBBB&#39;, &#39;Description&#39;: &#39;Fruits&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;2&#39;, &#39;Quantity&#39;: 77, &#39;Price&#39;: 10.5},
{&#39;BarCode&#39;: &#39;12345678CCCC&#39;, &#39;Description&#39;: &#39;Fruits&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;3&#39;, &#39;Quantity&#39;: 10, &#39;Price&#39;: 11.0}]
5 12345678EFGH   12345678         Oranges       T          2       11         12.9     [{&#39;BarCode&#39;: &#39;12345678AAAA&#39;, &#39;Description&#39;: &#39;Fruits&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;0&#39;, &#39;Quantity&#39;: 99, &#39;Price&#39;: 12.0},
{&#39;BarCode&#39;: &#39;12345678BBBB&#39;, &#39;Description&#39;: &#39;Fruits&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;2&#39;, &#39;Quantity&#39;: 77, &#39;Price&#39;: 10.5}]
6 67890123IJKL   67890123         Oats          F          2       90         67.0     [{&#39;BarCode&#39;: &#39;67890123XXXX&#39;, &#39;Description&#39;: &#39;Snacks&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;0&#39;, &#39;Quantity&#39;: 99, &#39;Price&#39;: 12.0},
{&#39;BarCode&#39;: &#39;67890123YYYY&#39;, &#39;Description&#39;: &#39;Snacks&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;2&#39;, &#39;Quantity&#39;: 77, &#39;Price&#39;: 10.5}]
7 67890123XXXX   67890123         Snacks        H          0       99         12.0     NaN
8 67890123YYYY   67890123         Snacks        H          2       77         10.5     NaN
9 67890123ZZZZ   67890123         Snacks        H          3       10         11.0     NaN
10 67890123ABCD  67890123         Yoghurt       M          3       52         15.6     [{&#39;BarCode&#39;: &#39;67890123XXXX&#39;, &#39;Description&#39;: &#39;Snacks&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;0&#39;, &#39;Quantity&#39;: 99, &#39;Price&#39;: 12.0},
{&#39;BarCode&#39;: &#39;67890123YYYY&#39;, &#39;Description&#39;: &#39;Snacks&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;2&#39;, &#39;Quantity&#39;: 77, &#39;Price&#39;: 10.5},
{&#39;BarCode&#39;: &#39;67890123ZZZZ&#39;, &#39;Description&#39;: &#39;Snacks&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;3&#39;, &#39;Quantity&#39;: 10, &#39;Price&#39;: 11.0}]
11 67890123EFGH  67890123         Cookies       M          4       11         12.9     [{&#39;BarCode&#39;: &#39;67890123XXXX&#39;, &#39;Description&#39;: &#39;Snacks&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;0&#39;, &#39;Quantity&#39;: 99, &#39;Price&#39;: 12.0},
{&#39;BarCode&#39;: &#39;67890123YYYY&#39;, &#39;Description&#39;: &#39;Snacks&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;2&#39;, &#39;Quantity&#39;: 77, &#39;Price&#39;: 10.5},
{&#39;BarCode&#39;: &#39;67890123ZZZZ&#39;, &#39;Description&#39;: &#39;Snacks&#39;, &#39;Category&#39;: &#39;H&#39;, &#39;Code&#39;: &#39;3&#39;, &#39;Quantity&#39;: 10, &#39;Price&#39;: 11.0}]

This is what I have done to get list of row jsons:

item_df.groupby([&#39;Extracted_Code&#39;, &#39;Category&#39;, &#39;Code&#39;]).apply(lambda x: x.to_dict(&#39;records&#39;)).reset_index(name=&#39;Combined&#39;)

But I am confused on how to apply the condition to each group without losing any columns in the end result.

答案1

得分: 2

你可以执行自我合并，并筛选出符合条件的行：
```python
m = df.reset_index().merge(df, on="Extracted_Code", suffixes=("_x", ""))
m = m[(m["Category"] == "H") & (m["Code"] <= m["Code_x"]) & (m["Category_x"] != "H")]

# .reset_index() 允许你执行 .groupby("index")，然后可以将 .to_dict("records") 添加到：
combined = m.groupby("index").apply(lambda group: 
    group[["BarCode", "Description", "Category", "Code", "Quantity", "Price"]].to_dict("records")
).rename("Combined")

# 然后你可以执行 .join：
df.join(combined)

英文:

You could perform a self-merge, and filter out rows that match your criteria:

m = df.reset_index().merge(df, on=&quot;Extracted_Code&quot;, suffixes=(&quot;_x&quot;, &quot;&quot;))
m = m[ (m[&quot;Category&quot;] == &quot;H&quot;) &amp; (m[&quot;Code&quot;] &lt;= m[&quot;Code_x&quot;]) &amp; (m[&quot;Category_x&quot;] != &quot;H&quot;) ]

    index     BarCode_x Extracted_Code Description_x Category_x Code_x  Quantity_x  Price_x       BarCode Description Category Code  Quantity  Price
18      3  12345678ABCD       12345678        Apples          M      1          52     15.6  12345678AAAA      Fruits        H    0        99   12.0
24      5  12345678IJKL       12345678       Mangoes          S      4          90     67.0  12345678AAAA      Fruits        H    0        99   12.0
25      5  12345678IJKL       12345678       Mangoes          S      4          90     67.0  12345678BBBB      Fruits        H    2        77   10.5
26      5  12345678IJKL       12345678       Mangoes          S      4          90     67.0  12345678CCCC      Fruits        H    3        10   11.0
30      4  12345678EFGH       12345678       Oranges          T      2          11     12.9  12345678AAAA      Fruits        H    0        99   12.0
31      4  12345678EFGH       12345678       Oranges          T      2          11     12.9  12345678BBBB      Fruits        H    2        77   10.5
37     11  67890123IJKL       67890123          Oats          F      2          90     67.0  67890123XXXX      Snacks        H    0        99   12.0
38     11  67890123IJKL       67890123          Oats          F      2          90     67.0  67890123YYYY      Snacks        H    2        77   10.5
61      9  67890123ABCD       67890123       Yoghurt          M      3          52     15.6  67890123XXXX      Snacks        H    0        99   12.0
62      9  67890123ABCD       67890123       Yoghurt          M      3          52     15.6  67890123YYYY      Snacks        H    2        77   10.5
63      9  67890123ABCD       67890123       Yoghurt          M      3          52     15.6  67890123ZZZZ      Snacks        H    3        10   11.0
67     10  67890123EFGH       67890123       Cookies          M      4          11     12.9  67890123XXXX      Snacks        H    0        99   12.0
68     10  67890123EFGH       67890123       Cookies          M      4          11     12.9  67890123YYYY      Snacks        H    2        77   10.5
69     10  67890123EFGH       67890123       Cookies          M      4          11     12.9  67890123ZZZZ      Snacks        H    3        10   11.0

The .reset_index() allows you to then .groupby("index") which you could then add your .to_dict("records") to:

combined = m.groupby(&quot;index&quot;).apply(lambda group: 
group[[&quot;BarCode&quot;, &quot;Description&quot;, &quot;Category&quot;, 
&quot;Code&quot;, &quot;Quantity&quot;, &quot;Price&quot;
]].to_dict(&quot;records&quot;)
).rename(&quot;Combined&quot;)

Which you can then .join:

&gt;&gt;&gt; df.join(combined)
BarCode Extracted_Code Description Category Code  Quantity  Price                                           Combined
0   12345678AAAA       12345678      Fruits        H    0        99   12.0                                                NaN
1   12345678BBBB       12345678      Fruits        H    2        77   10.5                                                NaN
2   12345678CCCC       12345678      Fruits        H    3        10   11.0                                                NaN
3   12345678ABCD       12345678      Apples        M    1        52   15.6  [{&#39;BarCode&#39;: &#39;12345678AAAA&#39;, &#39;Description&#39;: &#39;F...
5   12345678IJKL       12345678     Mangoes        S    4        90   67.0  [{&#39;BarCode&#39;: &#39;12345678AAAA&#39;, &#39;Description&#39;: &#39;F...
4   12345678EFGH       12345678     Oranges        T    2        11   12.9  [{&#39;BarCode&#39;: &#39;12345678AAAA&#39;, &#39;Description&#39;: &#39;F...
11  67890123IJKL       67890123        Oats        F    2        90   67.0  [{&#39;BarCode&#39;: &#39;67890123XXXX&#39;, &#39;Description&#39;: &#39;S...
6   67890123XXXX       67890123      Snacks        H    0        99   12.0                                                NaN
7   67890123YYYY       67890123      Snacks        H    2        77   10.5                                                NaN
8   67890123ZZZZ       67890123      Snacks        H    3        10   11.0                                                NaN
9   67890123ABCD       67890123     Yoghurt        M    3        52   15.6  [{&#39;BarCode&#39;: &#39;67890123XXXX&#39;, &#39;Description&#39;: &#39;S...
10  67890123EFGH       67890123     Cookies        M    4        11   12.9  [{&#39;BarCode&#39;: &#39;67890123XXXX&#39;, &#39;Description&#39;: &#39;S...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas：对每个分组应用函数并将结果存储在新列中

问题

答案1

如何使用类型提示要求键值对，当键具有无效的标识符时？

如何在使用Django开发Web前端时插入count()值

尝试使用pytube下载时出现问题

正则表达式按括号拆分，但不是所有括号。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。