英文:
Get cumsum of grouped dataframe with merged information from different rows
问题
我有以下挑战:拥有一个数据帧
data1 = {
'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
}
df1 = pd.DataFrame(data1)
我想要对订单进行累积求和(=每个'Product'和'Name'对的'Amount'求和)。我的方法是按['Product', 'Name']进行分组并转换为pd.series。
df1['Cumsum orders'] = df1.groupby(['Product', 'Name'])['Amount'].transform(pd.Series.cumsum)
print('line ', lineno(), 'df1 with cumsum \n ', df1)
结果是
Product Name Region Country Amount Cumsum orders
0 Product1 Bayer Europe France 910 910
1 Product2 Sanofi Europe NaN 200 200
2 Product3 Pfizer Asia NaN 898 898
3 Product1 Bayer NaN NaN 12 922
4 Product1 Bayer NaN NaN 50 972
5 Product1 Bayer NaN NaN 13 985
6 Product1 AstraZeneca NaN India 52 52
7 Product4 Company1 Asia Indonesia 250 250
8 Product5 Company2 NaN NaN 260 260
9 Product6 Company2 NaN NaN 270 270
但我希望获得每个'Product' - 'Name'对的唯一行,其中包含不同行的'Region'和'Country'信息。我期望:
# Product Name Region Country Amount Cumsum orders
# 0 Product1 Bayer Europe France 13 985
# 6 Product1 AstraZeneca NaN India 52 52
# 1 Product2 Sanofi Europe NaN 200 200
# 2 Product3 Pfizer Asia NaN 898 898
# 4 Product4 Company1 Asia Indonesia 250 250
# 5 Product5 Company2 NaN NaN 260 260
# 6 Product6 Company2 NaN NaN 270 270
原则上,另一个已经回答的问题(https://stackoverflow.com/questions/54366441/merge-information-from-rows-with-same-index-in-a-single-row-with-pandas)处理了这个问题。但不幸的是,我无法选择我的数据帧中的'non-numeric'数据类型。非常感谢任何提示。
英文:
I have the following challenge: having a dataframe
data1 = {
'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
}
df1 = pd.DataFrame(data1)
I would like to have a cumsum of orders (= sum of 'Amount' for each pair 'Product' and 'Name'). My approach is to groupby '['Product', 'Name']' and transform to a pd.series.
df1['Cumsum orders'] = df1.groupby(['Product', 'Name'])['Amount'].transform(pd.Series.cumsum)
print('line ', lineno(), 'df1 with cumsum \n ', df1)
It results in
Product Name Region Country Amount Cumsum orders
0 Product1 Bayer Europe France 910 910
1 Product2 Sanofi Europe NaN 200 200
2 Product3 Pfizer Asia NaN 898 898
3 Product1 Bayer NaN NaN 12 922
4 Product1 Bayer NaN NaN 50 972
5 Product1 Bayer NaN NaN 13 985
6 Product1 AstraZeneca NaN India 52 52
7 Product4 Company1 Asia Indonesia 250 250
8 Product5 Company2 NaN NaN 260 260
9 Product6 Company2 NaN NaN 270 270
But I want is to get unique rows for each 'Product' - 'Name'-pair which contains the information on 'Region' and 'Country' from different rows
I would expect:
# Product Name Region Country Amount Cumsum orders
# 0 Product1 Bayer Europe France 13 985
# 6 Product1 AstraZeneca NaN India 52 52
# 1 Product2 Sanofi Europe NaN 200 200
# 2 Product3 Pfizer Asia NaN 898 898
# 4 Product4 Company1 Asia Indonesia 250 250
# 5 Product5 Company2 NaN NaN 260 260
# 6 Product6 Company2 NaN NaN 270 270
In principle, another already answered question (https://stackoverflow.com/questions/54366441/merge-information-from-rows-with-same-index-in-a-single-row-with-pandas) deals with this issue. But I can't select by 'non-numeric' dtypes in my dataframe, unfortunately. I am grateful for any hint.
答案1
得分: 2
import pandas as pd
data1 = {
'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer', 'AstraZeneca', 'Company1', 'Company2', 'Company2'],
'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan],
'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
'Amount' : [910, 200, 898, 12, 50, 13, 52, 250, 260, 270],
}
df1 = pd.DataFrame(data1)
r = (df1.groupby(['Product', 'Name'])
.agg({'Region':'first', 'Country':'first', 'Amount':'sum'})
)
print(r)
Result
Region Country Amount
Product Name
Product1 AstraZeneca NaN India 52
Bayer Europe France 985
Product2 Sanofi Europe NaN 200
Product3 Pfizer Asia NaN 898
Product4 Company1 Asia Indonesia 250
Product5 Company2 NaN NaN 260
Product6 Company2 NaN NaN 270
英文:
import pandas as pd
data1 = {
'Product' : ['Product1', 'Product2', 'Product3', 'Product1', 'Product1', 'Product1','Product1', 'Product4', 'Product5', 'Product6'],
'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'Bayer', 'Bayer', 'Bayer' ,'AstraZeneca', 'Company1', 'Company2', 'Company2'],
'Region' : ['Europe', 'Europe', 'Asia', np.nan, np.nan, np.nan, np.nan, 'Asia', np.nan, np.nan ],
'Country' : ['France', np.nan, np.nan, np.nan, np.nan, np.nan, 'India', 'Indonesia', np.nan, np.nan],
'Amount' : [910, 200, 898, 12, 50, 13, 52, 250,260,270],
}
df1 = pd.DataFrame(data1)
r = (df1.groupby(['Product', 'Name'])
.agg({'Region':'first' , 'Country':'first', 'Amount':'sum'})
)
print(r)
Result
Region Country Amount
Product Name
Product1 AstraZeneca NaN India 52
Bayer Europe France 985
Product2 Sanofi Europe NaN 200
Product3 Pfizer Asia NaN 898
Product4 Company1 Asia Indonesia 250
Product5 Company2 NaN NaN 260
Product6 Company2 NaN NaN 270
答案2
得分: 1
示例
df1["累计订单数"] = df1.groupby(['产品', '名称'])["金额"].transform("sum")
df1 = df1.drop_duplicates(['产品', '名称'])
df1 = df1.sort_values('产品')
结果
产品 名称 区域 国家 金额 累计订单数
0 Product1 Bayer Europe France 910 985
6 Product1 AstraZeneca NaN India 52 52
1 Product2 Sanofi Europe NaN 200 200
2 Product3 Pfizer Asia NaN 898 898
7 Product4 Company1 Asia Indonesia 250 250
8 Product5 Company2 NaN NaN 260 260
9 Product6 Company2 NaN NaN 270 270
如果我已经满意地回答了您的问题,请考虑接受答案。
请检查金额
列是否需要删除。
英文:
Hope this can help you:
Example
df1["Cumsum Orders"] = df1.groupby(['Product', 'Name'])["Amount"].transform("sum")
df1 = df1.drop_duplicates(["Product", "Name"])
df1 = df1.sort_values("Product")
Results
Product Name Region Country Amount Cumsum Orders
0 Product1 Bayer Europe France 910 985
6 Product1 AstraZeneca NaN India 52 52
1 Product2 Sanofi Europe NaN 200 200
2 Product3 Pfizer Asia NaN 898 898
7 Product4 Company1 Asia Indonesia 250 250
8 Product5 Company2 NaN NaN 260 260
9 Product6 Company2 NaN NaN 270 270
If I have answered your question to your satisfaction, then consider accepting the answer.
And Check Amount
column should be drop or not ?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论