有没有办法计算仅在几列中(仅唯一值)计算运行总数?

huangapple go评论67阅读模式
英文:

Is there a way to calculate the running total across only a few columns (unique values only)?

问题

我正在尝试计算数据框中特定列的累积总和,只对唯一值进行计算感兴趣。

我下面有一个示例数据框:

姓名 产品 日期 地点 类型 销售 运费百分比 总费用
Tom 香蕉 01-01-2021 纽约 水果 120 0.01 1.2
Tom 苹果 01-01-2021 纽约 水果 120 0.01 1.2
Tom 香蕉 02-01-2021 德克萨斯 水果 420 0.01 4.2
Tom 香蕉 02-01-2021 德克萨斯 水果 120 0.01 1.2
Mat 香蕉 02-01-2021 纽约 水果 30 0.01 0.3

我想要一个“累积总计”列,但只考虑姓名和日期(作为分组列),并显示总费用列的唯一值之和。这将导致类似以下的结果:

姓名 产品 日期 地点 类型 销售 运费百分比 总费用 累积总计
Tom 香蕉 01-01-2021 纽约 水果 120 0.01 1.2 1.2
Tom 苹果 01-01-2021 纽约 水果 120 0.01 1.2 1.2
Tom 香蕉 02-01-2021 德克萨斯 水果 420 0.01 4.2 4.2
Tom 香蕉 02-01-2021 德克萨斯 水果 120 0.01 1.2 5.4
Mat 香蕉 02-01-2021 纽约 水果 30 0.01 0.3 0.3

我迷茫了,我还没有找到任何能给我这个结果的方法。

英文:

I am trying to calculate the running total across a few specific columns of my dataFrame and I am only interested in calculating using unique values.

I have below an example dataframe:

Name Product Date Location Type Sales Ship Fee % Total Fee
Tom Bananas 01-01-2021 NY Fruit 120 0.01 1.2
Tom Apples 01-01-2021 NY Fruit 120 0.01 1.2
Tom Bananas 02-01-2021 TX Fruit 420 0.01 4.2
Tom Bananas 02-01-2021 TX Fruit 120 0.01 1.2
Mat Bananas 02-01-2021 NY Fruit 30 0.01 0.3

I want to have a Running Total column, but only considering the Name and Date (as groupBy columns) and showing the sum of the unique values of Total Fee column. That would result in something like this:

Name Product Date Location Type Sales Ship Fee % Total Fee Running Total
Tom Bananas 01-01-2021 NY Fruit 120 0.01 1.2 1.2
Tom Apples 01-01-2021 NY Fruit 120 0.01 1.2 1.2
Tom Bananas 02-01-2021 TX Fruit 420 0.01 4.2 4.2
Tom Bananas 02-01-2021 TX Fruit 120 0.01 1.2 5.4
Mat Bananas 02-01-2021 NY Fruit 30 0.01 0.3 0.3

I am lost -> I haven't been able to find anything that can give me this result.

答案1

得分: 2

以下是翻译好的内容:

Option 1: 使用groupby按“Name”和“Date”分组,然后仅对“Total Fee”的唯一值进行cumsum操作

df['Running Total'] = df.drop_duplicates(['Name', 'Date', 'Total Fee']).groupby(['Name', 'Date'])['Total Fee'].cumsum()
df['Running Total'] = df['Running Total'].fillna(df['Total Fee'])

Option 2: 使用groupby按“Name”、“Product”和“Date”分组,然后进行cumsum操作,给出每个人每天每种产品的累积总额。

df['Running Total'] = df.groupby(['Name', 'Product', 'Date'], as_index=False)['Total Fee'].cumsum()

测试和示例

给定这个数据框:

Name Product Date Location Type Sales Ship Fee % Total Fee
0 Tom Bananas 01-01-2021 NY Fruit 120 0.01 1.2
1 Tom Apples 01-01-2021 NY Fruit 120 0.01 1.2
2 Tom Bananas 02-01-2021 TX Fruit 420 0.01 4.2
3 Tom Bananas 02-01-2021 TX Fruit 120 0.01 1.2
4 Mat Bananas 02-01-2021 NY Fruit 30 0.01 0.3
5 Mat Bananas 02-01-2021 NY Fruit 50 0.01 0.3
6 Mat Apples 03-01-2021 NY Vegetable 80 0.02 1.6

Option 1 结果:

Name Product Date Location Type Sales Ship Fee % Total Fee Running Total
0 Tom Bananas 01-01-2021 NY Fruit 120 0.01 1.2 1.2
1 Tom Apples 01-01-2021 NY Fruit 120 0.01 1.2 1.2
2 Tom Bananas 02-01-2021 TX Fruit 420 0.01 4.2 4.2
3 Tom Bananas 02-01-2021 TX Fruit 120 0.01 1.2 5.4
4 Mat Bananas 02-01-2021 NY Fruit 30 0.01 0.3 0.3
5 Mat Bananas 02-01-2021 NY Fruit 50 0.01 0.3 0.3
6 Mat Apples 03-01-2021 NY Vegetable 80 0.02 1.6 1.6

Option 2 结果:

Name Product Date Location Type Sales Ship Fee % Total Fee Running Total
0 Tom Bananas 01-01-2021 NY Fruit 120 0.01 1.2 1.2
1 Tom Apples 01-01-2021 NY Fruit 120 0.01 1.2 1.2
2 Tom Bananas 02-01-2021 TX Fruit 420 0.01 4.2 4.2
3 Tom Bananas 02-01-2021 TX Fruit 120 0.01 1.2 5.4
4 Mat Bananas 02-01-2021 NY Fruit 30 0.01 0.3 0.3
5 Mat Bananas 02-01-2021 NY Fruit 50 0.01 0.3 0.6
6 Mat Apples 03-01-2021 NY Vegetable 80 0.02 1.6 1.6
英文:

I think this is what you are looking for:

Option 1: groupby "Name" and "Date" then cumsum only unique values for Total Fee

df['Running Total'] = df.drop_duplicates(['Name', 'Date', 'Total Fee']).groupby(['Name', 'Date'])['Total Fee'].cumsum()
df['Running Total'] = df['Running Total'].fillna(df['Total Fee'])

Option 2: groupby "Name", "Product", "Date". Then cumsum --> gives the accumulated sum for each product on each day for each person.

df['Running Total'] = df.groupby(['Name', 'Product','Date'], as_index=False)['Total Fee'].cumsum()

Testing and examples

Given this dataframe:

Name Product Date Location Type Sales Ship Fee % Total Fee
0 Tom Bananas 01-01-2021 NY Fruit 120 0.01 1.2
1 Tom Apples 01-01-2021 NY Fruit 120 0.01 1.2
2 Tom Bananas 02-01-2021 TX Fruit 420 0.01 4.2
3 Tom Bananas 02-01-2021 TX Fruit 120 0.01 1.2
4 Mat Bananas 02-01-2021 NY Fruit 30 0.01 0.3
5 Mat Bananas 02-01-2021 NY Fruit 50 0.01 0.3
6 Mat Apples 03-01-2021 NY Vegetable 80 0.02 1.6

Option 1 result:

Name Product Date Location Type Sales Ship Fee % Total Fee Running Total
0 Tom Bananas 01-01-2021 NY Fruit 120 0.01 1.2 1.2
1 Tom Apples 01-01-2021 NY Fruit 120 0.01 1.2 1.2
2 Tom Bananas 02-01-2021 TX Fruit 420 0.01 4.2 4.2
3 Tom Bananas 02-01-2021 TX Fruit 120 0.01 1.2 5.4
4 Mat Bananas 02-01-2021 NY Fruit 30 0.01 0.3 0.3
5 Mat Bananas 02-01-2021 NY Fruit 50 0.01 0.3 0.3
6 Mat Apples 03-01-2021 NY Vegetable 80 0.02 1.6 1.6

Option 2 result:

Name Product Date Location Type Sales Ship Fee % Total Fee Running Total
0 Tom Bananas 01-01-2021 NY Fruit 120 0.01 1.2 1.2
1 Tom Apples 01-01-2021 NY Fruit 120 0.01 1.2 1.2
2 Tom Bananas 02-01-2021 TX Fruit 420 0.01 4.2 4.2
3 Tom Bananas 02-01-2021 TX Fruit 120 0.01 1.2 5.4
4 Mat Bananas 02-01-2021 NY Fruit 30 0.01 0.3 0.3
5 Mat Bananas 02-01-2021 NY Fruit 50 0.01 0.3 0.6
6 Mat Apples 03-01-2021 NY Vegetable 80 0.02 1.6 1.6

huangapple
  • 本文由 发表于 2023年2月23日 22:02:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75545837.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定