获取分组数据框的累积和,其中包含来自不同行的合并信息。

huangapple go评论105阅读模式
英文:

Get cumsum of grouped dataframe with merged information from different rows

问题

我有以下挑战:拥有一个数据帧

  1. data1 = {
  2. 'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
  3. 'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
  4. 'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
  5. 'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
  6. 'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
  7. }
  8. df1 = pd.DataFrame(data1)

我想要对订单进行累积求和(=每个'Product'和'Name'对的'Amount'求和)。我的方法是按['Product', 'Name']进行分组并转换为pd.series。

  1. df1['Cumsum orders'] = df1.groupby(['Product', 'Name'])['Amount'].transform(pd.Series.cumsum)
  2. print('line ', lineno(), 'df1 with cumsum \n ', df1)

结果是

  1. Product Name Region Country Amount Cumsum orders
  2. 0 Product1 Bayer Europe France 910 910
  3. 1 Product2 Sanofi Europe NaN 200 200
  4. 2 Product3 Pfizer Asia NaN 898 898
  5. 3 Product1 Bayer NaN NaN 12 922
  6. 4 Product1 Bayer NaN NaN 50 972
  7. 5 Product1 Bayer NaN NaN 13 985
  8. 6 Product1 AstraZeneca NaN India 52 52
  9. 7 Product4 Company1 Asia Indonesia 250 250
  10. 8 Product5 Company2 NaN NaN 260 260
  11. 9 Product6 Company2 NaN NaN 270 270

但我希望获得每个'Product' - 'Name'对的唯一行,其中包含不同行的'Region'和'Country'信息。我期望:

  1. # Product Name Region Country Amount Cumsum orders
  2. # 0 Product1 Bayer Europe France 13 985
  3. # 6 Product1 AstraZeneca NaN India 52 52
  4. # 1 Product2 Sanofi Europe NaN 200 200
  5. # 2 Product3 Pfizer Asia NaN 898 898
  6. # 4 Product4 Company1 Asia Indonesia 250 250
  7. # 5 Product5 Company2 NaN NaN 260 260
  8. # 6 Product6 Company2 NaN NaN 270 270

原则上,另一个已经回答的问题(https://stackoverflow.com/questions/54366441/merge-information-from-rows-with-same-index-in-a-single-row-with-pandas)处理了这个问题。但不幸的是,我无法选择我的数据帧中的'non-numeric'数据类型。非常感谢任何提示。

英文:

I have the following challenge: having a dataframe

  1. data1 = {
  2. 'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
  3. 'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
  4. 'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
  5. 'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
  6. 'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
  7. }
  8. df1 = pd.DataFrame(data1)

I would like to have a cumsum of orders (= sum of 'Amount' for each pair 'Product' and 'Name'). My approach is to groupby '['Product', 'Name']' and transform to a pd.series.

  1. df1['Cumsum orders'] = df1.groupby(['Product', 'Name'])['Amount'].transform(pd.Series.cumsum)
  2. print('line ', lineno(), 'df1 with cumsum \n ', df1)

It results in

  1. Product Name Region Country Amount Cumsum orders
  2. 0 Product1 Bayer Europe France 910 910
  3. 1 Product2 Sanofi Europe NaN 200 200
  4. 2 Product3 Pfizer Asia NaN 898 898
  5. 3 Product1 Bayer NaN NaN 12 922
  6. 4 Product1 Bayer NaN NaN 50 972
  7. 5 Product1 Bayer NaN NaN 13 985
  8. 6 Product1 AstraZeneca NaN India 52 52
  9. 7 Product4 Company1 Asia Indonesia 250 250
  10. 8 Product5 Company2 NaN NaN 260 260
  11. 9 Product6 Company2 NaN NaN 270 270

But I want is to get unique rows for each 'Product' - 'Name'-pair which contains the information on 'Region' and 'Country' from different rows
I would expect:

  1. # Product Name Region Country Amount Cumsum orders
  2. # 0 Product1 Bayer Europe France 13 985
  3. # 6 Product1 AstraZeneca NaN India 52 52
  4. # 1 Product2 Sanofi Europe NaN 200 200
  5. # 2 Product3 Pfizer Asia NaN 898 898
  6. # 4 Product4 Company1 Asia Indonesia 250 250
  7. # 5 Product5 Company2 NaN NaN 260 260
  8. # 6 Product6 Company2 NaN NaN 270 270

In principle, another already answered question (https://stackoverflow.com/questions/54366441/merge-information-from-rows-with-same-index-in-a-single-row-with-pandas) deals with this issue. But I can't select by 'non-numeric' dtypes in my dataframe, unfortunately. I am grateful for any hint.

答案1

得分: 2

  1. import pandas as pd
  2. data1 = {
  3. 'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
  4. 'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer', 'AstraZeneca', 'Company1', 'Company2', 'Company2'],
  5. 'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan],
  6. 'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
  7. 'Amount' : [910, 200, 898, 12, 50, 13, 52, 250, 260, 270],
  8. }
  9. df1 = pd.DataFrame(data1)
  10. r = (df1.groupby(['Product', 'Name'])
  11. .agg({'Region':'first', 'Country':'first', 'Amount':'sum'})
  12. )
  13. print(r)

Result

  1. Region Country Amount
  2. Product Name
  3. Product1 AstraZeneca NaN India 52
  4. Bayer Europe France 985
  5. Product2 Sanofi Europe NaN 200
  6. Product3 Pfizer Asia NaN 898
  7. Product4 Company1 Asia Indonesia 250
  8. Product5 Company2 NaN NaN 260
  9. Product6 Company2 NaN NaN 270
英文:
  1. import pandas as pd
  2. data1 = {
  3. 'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
  4. 'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
  5. 'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
  6. 'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
  7. 'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
  8. }
  9. df1 = pd.DataFrame(data1)
  10. r = (df1.groupby(['Product', 'Name'])
  11. .agg({'Region':'first' , 'Country':'first', 'Amount':'sum'})
  12. )
  13. print(r)

Result

  1. Region Country Amount
  2. Product Name
  3. Product1 AstraZeneca NaN India 52
  4. Bayer Europe France 985
  5. Product2 Sanofi Europe NaN 200
  6. Product3 Pfizer Asia NaN 898
  7. Product4 Company1 Asia Indonesia 250
  8. Product5 Company2 NaN NaN 260
  9. Product6 Company2 NaN NaN 270

答案2

得分: 1

示例

  1. df1["累计订单数"] = df1.groupby(['产品', '名称'])["金额"].transform("sum")
  2. df1 = df1.drop_duplicates(['产品', '名称'])
  3. df1 = df1.sort_values('产品')

结果

  1. 产品 名称 区域 国家 金额 累计订单数
  2. 0 Product1 Bayer Europe France 910 985
  3. 6 Product1 AstraZeneca NaN India 52 52
  4. 1 Product2 Sanofi Europe NaN 200 200
  5. 2 Product3 Pfizer Asia NaN 898 898
  6. 7 Product4 Company1 Asia Indonesia 250 250
  7. 8 Product5 Company2 NaN NaN 260 260
  8. 9 Product6 Company2 NaN NaN 270 270

如果我已经满意地回答了您的问题,请考虑接受答案。

请检查金额列是否需要删除。

英文:

Hope this can help you:

Example

  1. df1["Cumsum Orders"] = df1.groupby(['Product', 'Name'])["Amount"].transform("sum")
  2. df1 = df1.drop_duplicates(["Product", "Name"])
  3. df1 = df1.sort_values("Product")

Results

  1. Product Name Region Country Amount Cumsum Orders
  2. 0 Product1 Bayer Europe France 910 985
  3. 6 Product1 AstraZeneca NaN India 52 52
  4. 1 Product2 Sanofi Europe NaN 200 200
  5. 2 Product3 Pfizer Asia NaN 898 898
  6. 7 Product4 Company1 Asia Indonesia 250 250
  7. 8 Product5 Company2 NaN NaN 260 260
  8. 9 Product6 Company2 NaN NaN 270 270

If I have answered your question to your satisfaction, then consider accepting the answer.

And Check Amount column should be drop or not ?

huangapple
  • 本文由 发表于 2023年3月9日 12:50:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75680542.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定