获取分组数据框的累积和,其中包含来自不同行的合并信息。

huangapple go评论67阅读模式
英文:

Get cumsum of grouped dataframe with merged information from different rows

问题

我有以下挑战:拥有一个数据帧

data1 = {
        'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
         'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
         'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
        'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan,  'India', 'Indonesia', np.nan, np.nan],
         'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
         }
    
df1 = pd.DataFrame(data1)

我想要对订单进行累积求和(=每个'Product'和'Name'对的'Amount'求和)。我的方法是按['Product', 'Name']进行分组并转换为pd.series。

df1['Cumsum orders'] = df1.groupby(['Product', 'Name'])['Amount'].transform(pd.Series.cumsum)

print('line ', lineno(), 'df1 with cumsum \n ', df1)

结果是

      Product         Name  Region    Country  Amount  Cumsum orders
0  Product1        Bayer  Europe     France     910            910
1  Product2       Sanofi  Europe        NaN     200            200
2  Product3       Pfizer    Asia        NaN     898            898
3  Product1        Bayer     NaN        NaN      12            922
4  Product1        Bayer     NaN        NaN      50            972
5  Product1        Bayer     NaN        NaN      13            985
6  Product1  AstraZeneca     NaN      India      52             52
7  Product4     Company1    Asia  Indonesia     250            250
8  Product5     Company2     NaN        NaN     260            260
9  Product6     Company2     NaN        NaN     270            270

但我希望获得每个'Product' - 'Name'对的唯一行,其中包含不同行的'Region'和'Country'信息。我期望:

    #      Product         Name  Region    Country  Amount  Cumsum orders
    #   0  Product1        Bayer  Europe        France  13            985
    #   6  Product1  AstraZeneca     NaN      India      52             52
    #   1  Product2       Sanofi  Europe        NaN     200            200   
    #   2  Product3       Pfizer    Asia        NaN     898            898
    #   4  Product4     Company1    Asia  Indonesia     250            250
    #   5  Product5     Company2     NaN        NaN     260            260
    #   6  Product6     Company2     NaN        NaN     270            270 

原则上,另一个已经回答的问题(https://stackoverflow.com/questions/54366441/merge-information-from-rows-with-same-index-in-a-single-row-with-pandas)处理了这个问题。但不幸的是,我无法选择我的数据帧中的'non-numeric'数据类型。非常感谢任何提示。

英文:

I have the following challenge: having a dataframe

data1 = {
        'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
         'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
         'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
        'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan,  'India', 'Indonesia', np.nan, np.nan],
         'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
         }
    
df1 = pd.DataFrame(data1)

I would like to have a cumsum of orders (= sum of 'Amount' for each pair 'Product' and 'Name'). My approach is to groupby '['Product', 'Name']' and transform to a pd.series.

df1['Cumsum orders'] = df1.groupby(['Product', 'Name'])['Amount'].transform(pd.Series.cumsum)

print('line ', lineno(), 'df1 with cumsum \n ', df1)

It results in

      Product         Name  Region    Country  Amount  Cumsum orders
0  Product1        Bayer  Europe     France     910            910
1  Product2       Sanofi  Europe        NaN     200            200
2  Product3       Pfizer    Asia        NaN     898            898
3  Product1        Bayer     NaN        NaN      12            922
4  Product1        Bayer     NaN        NaN      50            972
5  Product1        Bayer     NaN        NaN      13            985
6  Product1  AstraZeneca     NaN      India      52             52
7  Product4     Company1    Asia  Indonesia     250            250
8  Product5     Company2     NaN        NaN     260            260
9  Product6     Company2     NaN        NaN     270            270

But I want is to get unique rows for each 'Product' - 'Name'-pair which contains the information on 'Region' and 'Country' from different rows
I would expect:

    #      Product         Name  Region    Country  Amount  Cumsum orders
    #   0  Product1        Bayer  Europe        France  13            985
    #   6  Product1  AstraZeneca     NaN      India      52             52
    #   1  Product2       Sanofi  Europe        NaN     200            200   
    #   2  Product3       Pfizer    Asia        NaN     898            898
    #   4  Product4     Company1    Asia  Indonesia     250            250
    #   5  Product5     Company2     NaN        NaN     260            260
    #   6  Product6     Company2     NaN        NaN     270            270 

In principle, another already answered question (https://stackoverflow.com/questions/54366441/merge-information-from-rows-with-same-index-in-a-single-row-with-pandas) deals with this issue. But I can't select by 'non-numeric' dtypes in my dataframe, unfortunately. I am grateful for any hint.

答案1

得分: 2

import pandas as pd

data1 = {
    'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
    'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer', 'AstraZeneca', 'Company1', 'Company2', 'Company2'],
    'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan],
    'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
    'Amount' : [910, 200, 898, 12, 50, 13, 52, 250, 260, 270],
}

df1 = pd.DataFrame(data1)

r = (df1.groupby(['Product', 'Name'])
    .agg({'Region':'first', 'Country':'first', 'Amount':'sum'})
)

print(r)

Result

                      Region    Country  Amount
Product  Name                                  
Product1 AstraZeneca     NaN      India      52
         Bayer        Europe     France     985
Product2 Sanofi       Europe        NaN     200
Product3 Pfizer         Asia        NaN     898
Product4 Company1       Asia  Indonesia     250
Product5 Company2        NaN        NaN     260
Product6 Company2        NaN        NaN     270
英文:
import pandas as pd

data1 = {
        'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
         'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
         'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
        'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan,  'India', 'Indonesia', np.nan, np.nan],
         'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
         }
    
df1 = pd.DataFrame(data1)

r = (df1.groupby(['Product', 'Name'])
        .agg({'Region':'first' , 'Country':'first', 'Amount':'sum'})
)

print(r)

Result

                      Region    Country  Amount
Product  Name                                  
Product1 AstraZeneca     NaN      India      52
         Bayer        Europe     France     985
Product2 Sanofi       Europe        NaN     200
Product3 Pfizer         Asia        NaN     898
Product4 Company1       Asia  Indonesia     250
Product5 Company2        NaN        NaN     260
Product6 Company2        NaN        NaN     270

答案2

得分: 1

示例

df1["累计订单数"] = df1.groupby(['产品', '名称'])["金额"].transform("sum")

df1 = df1.drop_duplicates(['产品', '名称'])

df1 = df1.sort_values('产品')

结果

    产品           名称  区域    国家  金额  累计订单数
0  Product1        Bayer  Europe     France     910            985
6  Product1  AstraZeneca     NaN      India      52             52
1  Product2       Sanofi  Europe        NaN     200            200
2  Product3       Pfizer    Asia        NaN     898            898
7  Product4     Company1    Asia  Indonesia     250            250
8  Product5     Company2     NaN        NaN     260            260
9  Product6     Company2     NaN        NaN     270            270

如果我已经满意地回答了您的问题,请考虑接受答案。

请检查金额列是否需要删除。

英文:

Hope this can help you:

Example

df1["Cumsum Orders"] = df1.groupby(['Product', 'Name'])["Amount"].transform("sum")

df1 = df1.drop_duplicates(["Product", "Name"])

df1 = df1.sort_values("Product")

Results

    Product         Name  Region    Country  Amount  Cumsum Orders
0  Product1        Bayer  Europe     France     910            985
6  Product1  AstraZeneca     NaN      India      52             52
1  Product2       Sanofi  Europe        NaN     200            200
2  Product3       Pfizer    Asia        NaN     898            898
7  Product4     Company1    Asia  Indonesia     250            250
8  Product5     Company2     NaN        NaN     260            260
9  Product6     Company2     NaN        NaN     270            270

If I have answered your question to your satisfaction, then consider accepting the answer.

And Check Amount column should be drop or not ?

huangapple
  • 本文由 发表于 2023年3月9日 12:50:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75680542.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定