2023年3月9日 12:50:51go评论105阅读模式

英文:

Get cumsum of grouped dataframe with merged information from different rows

问题

我有以下挑战：拥有一个数据帧

data1 = {
        'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
         'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
         'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
        'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan,  'India', 'Indonesia', np.nan, np.nan],
         'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
         }
    
df1 = pd.DataFrame(data1)

我想要对订单进行累积求和（=每个'Product'和'Name'对的'Amount'求和）。我的方法是按['Product', 'Name']进行分组并转换为pd.series。

df1['Cumsum orders'] = df1.groupby(['Product', 'Name'])['Amount'].transform(pd.Series.cumsum)
print('line ', lineno(), 'df1 with cumsum \n ', df1)

结果是

      Product         Name  Region    Country  Amount  Cumsum orders
0  Product1        Bayer  Europe     France     910            910
1  Product2       Sanofi  Europe        NaN     200            200
2  Product3       Pfizer    Asia        NaN     898            898
3  Product1        Bayer     NaN        NaN      12            922
4  Product1        Bayer     NaN        NaN      50            972
5  Product1        Bayer     NaN        NaN      13            985
6  Product1  AstraZeneca     NaN      India      52             52
7  Product4     Company1    Asia  Indonesia     250            250
8  Product5     Company2     NaN        NaN     260            260
9  Product6     Company2     NaN        NaN     270            270

但我希望获得每个'Product' - 'Name'对的唯一行，其中包含不同行的'Region'和'Country'信息。我期望：

    #      Product         Name  Region    Country  Amount  Cumsum orders
    #   0  Product1        Bayer  Europe        France  13            985
    #   6  Product1  AstraZeneca     NaN      India      52             52
    #   1  Product2       Sanofi  Europe        NaN     200            200   
    #   2  Product3       Pfizer    Asia        NaN     898            898
    #   4  Product4     Company1    Asia  Indonesia     250            250
    #   5  Product5     Company2     NaN        NaN     260            260
    #   6  Product6     Company2     NaN        NaN     270            270

原则上，另一个已经回答的问题（https://stackoverflow.com/questions/54366441/merge-information-from-rows-with-same-index-in-a-single-row-with-pandas）处理了这个问题。但不幸的是，我无法选择我的数据帧中的'non-numeric'数据类型。非常感谢任何提示。

英文:

I have the following challenge: having a dataframe

data1 = {
        &#39;Product&#39; : [&#39;Product1&#39;, &#39;Product2&#39;, &#39;Product3&#39;, &#39;Product1&#39;, &#39;Product1&#39;, &#39;Product1&#39;,&#39;Product1&#39;, &#39;Product4&#39;, &#39;Product5&#39;, &#39;Product6&#39;],
         &#39;Name&#39; : [&#39;Bayer&#39;, &#39;Sanofi&#39;, &#39;Pfizer&#39;, &#39;Bayer&#39;, &#39;Bayer&#39;, &#39;Bayer&#39; ,&#39;AstraZeneca&#39;, &#39;Company1&#39;, &#39;Company2&#39;, &#39;Company2&#39;],
         &#39;Region&#39; : [&#39;Europe&#39;, &#39;Europe&#39;, &#39;Asia&#39;, np.nan, np.nan, np.nan, np.nan, &#39;Asia&#39;, np.nan, np.nan ],
        &#39;Country&#39; : [&#39;France&#39;, np.nan, np.nan, np.nan, np.nan, np.nan,  &#39;India&#39;, &#39;Indonesia&#39;, np.nan, np.nan],
         &#39;Amount&#39; : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
         }
    
df1 = pd.DataFrame(data1)

I would like to have a cumsum of orders (= sum of 'Amount' for each pair 'Product' and 'Name'). My approach is to groupby '['Product', 'Name']' and transform to a pd.series.

df1[&#39;Cumsum orders&#39;] = df1.groupby([&#39;Product&#39;, &#39;Name&#39;])[&#39;Amount&#39;].transform(pd.Series.cumsum)
print(&#39;line &#39;, lineno(), &#39;df1 with cumsum \n &#39;, df1)

It results in

      Product         Name  Region    Country  Amount  Cumsum orders
0  Product1        Bayer  Europe     France     910            910
1  Product2       Sanofi  Europe        NaN     200            200
2  Product3       Pfizer    Asia        NaN     898            898
3  Product1        Bayer     NaN        NaN      12            922
4  Product1        Bayer     NaN        NaN      50            972
5  Product1        Bayer     NaN        NaN      13            985
6  Product1  AstraZeneca     NaN      India      52             52
7  Product4     Company1    Asia  Indonesia     250            250
8  Product5     Company2     NaN        NaN     260            260
9  Product6     Company2     NaN        NaN     270            270

But I want is to get unique rows for each 'Product' - 'Name'-pair which contains the information on 'Region' and 'Country' from different rows
I would expect:

    #      Product         Name  Region    Country  Amount  Cumsum orders
    #   0  Product1        Bayer  Europe        France  13            985
    #   6  Product1  AstraZeneca     NaN      India      52             52
    #   1  Product2       Sanofi  Europe        NaN     200            200   
    #   2  Product3       Pfizer    Asia        NaN     898            898
    #   4  Product4     Company1    Asia  Indonesia     250            250
    #   5  Product5     Company2     NaN        NaN     260            260
    #   6  Product6     Company2     NaN        NaN     270            270

In principle, another already answered question (https://stackoverflow.com/questions/54366441/merge-information-from-rows-with-same-index-in-a-single-row-with-pandas) deals with this issue. But I can't select by 'non-numeric' dtypes in my dataframe, unfortunately. I am grateful for any hint.

答案1

得分: 2

import pandas as pd
data1 = {
    'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
    'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer', 'AstraZeneca', 'Company1', 'Company2', 'Company2'],
    'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan],
    'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
    'Amount' : [910, 200, 898, 12, 50, 13, 52, 250, 260, 270],
}
df1 = pd.DataFrame(data1)
r = (df1.groupby(['Product', 'Name'])
    .agg({'Region':'first', 'Country':'first', 'Amount':'sum'})
)
print(r)

Result

                      Region    Country  Amount
Product  Name                                  
Product1 AstraZeneca     NaN      India      52
         Bayer        Europe     France     985
Product2 Sanofi       Europe        NaN     200
Product3 Pfizer         Asia        NaN     898
Product4 Company1       Asia  Indonesia     250
Product5 Company2        NaN        NaN     260
Product6 Company2        NaN        NaN     270

英文:

import pandas as pd
data1 = {
        &#39;Product&#39; : [&#39;Product1&#39;, &#39;Product2&#39;, &#39;Product3&#39;, &#39;Product1&#39;, &#39;Product1&#39;, &#39;Product1&#39;,&#39;Product1&#39;, &#39;Product4&#39;, &#39;Product5&#39;, &#39;Product6&#39;],
         &#39;Name&#39; : [&#39;Bayer&#39;, &#39;Sanofi&#39;, &#39;Pfizer&#39;, &#39;Bayer&#39;, &#39;Bayer&#39;, &#39;Bayer&#39; ,&#39;AstraZeneca&#39;, &#39;Company1&#39;, &#39;Company2&#39;, &#39;Company2&#39;],
         &#39;Region&#39; : [&#39;Europe&#39;, &#39;Europe&#39;, &#39;Asia&#39;, np.nan, np.nan, np.nan, np.nan, &#39;Asia&#39;, np.nan, np.nan ],
        &#39;Country&#39; : [&#39;France&#39;, np.nan, np.nan, np.nan, np.nan, np.nan,  &#39;India&#39;, &#39;Indonesia&#39;, np.nan, np.nan],
         &#39;Amount&#39; : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
         }
    
df1 = pd.DataFrame(data1)
r = (df1.groupby([&#39;Product&#39;, &#39;Name&#39;])
        .agg({&#39;Region&#39;:&#39;first&#39; , &#39;Country&#39;:&#39;first&#39;, &#39;Amount&#39;:&#39;sum&#39;})
)
print(r)

Result

                      Region    Country  Amount
Product  Name                                  
Product1 AstraZeneca     NaN      India      52
         Bayer        Europe     France     985
Product2 Sanofi       Europe        NaN     200
Product3 Pfizer         Asia        NaN     898
Product4 Company1       Asia  Indonesia     250
Product5 Company2        NaN        NaN     260
Product6 Company2        NaN        NaN     270

答案2

得分: 1

示例

df1["累计订单数"] = df1.groupby(['产品', '名称'])["金额"].transform("sum")
df1 = df1.drop_duplicates(['产品', '名称'])
df1 = df1.sort_values('产品')

结果

    产品           名称  区域    国家  金额  累计订单数
0  Product1        Bayer  Europe     France     910            985
6  Product1  AstraZeneca     NaN      India      52             52
1  Product2       Sanofi  Europe        NaN     200            200
2  Product3       Pfizer    Asia        NaN     898            898
7  Product4     Company1    Asia  Indonesia     250            250
8  Product5     Company2     NaN        NaN     260            260
9  Product6     Company2     NaN        NaN     270            270

如果我已经满意地回答了您的问题，请考虑接受答案。

请检查金额列是否需要删除。

英文:

Hope this can help you:

Example

df1[&quot;Cumsum Orders&quot;] = df1.groupby([&#39;Product&#39;, &#39;Name&#39;])[&quot;Amount&quot;].transform(&quot;sum&quot;)
df1 = df1.drop_duplicates([&quot;Product&quot;, &quot;Name&quot;])
df1 = df1.sort_values(&quot;Product&quot;)

Results

    Product         Name  Region    Country  Amount  Cumsum Orders
0  Product1        Bayer  Europe     France     910            985
6  Product1  AstraZeneca     NaN      India      52             52
1  Product2       Sanofi  Europe        NaN     200            200
2  Product3       Pfizer    Asia        NaN     898            898
7  Product4     Company1    Asia  Indonesia     250            250
8  Product5     Company2     NaN        NaN     260            260
9  Product6     Company2     NaN        NaN     270            270

If I have answered your question to your satisfaction, then consider accepting the answer.

And Check Amount column should be drop or not ?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取分组数据框的累积和，其中包含来自不同行的合并信息。

问题

答案1

答案2

示例

结果

Example

Results

出现Django模板语法错误。我该如何解决？

在pandas数据帧中如何基于其他行创建新列？

FileNotFoundError: [Errno 2] No such file or directory – Python

传递在多个Shell脚本中存在的命名参数/参数，最终传递给Python脚本。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。