英文:
Is there a way to calculate the running total across only a few columns (unique values only)?
问题
我正在尝试计算数据框中特定列的累积总和,只对唯一值进行计算感兴趣。
我下面有一个示例数据框:
姓名 | 产品 | 日期 | 地点 | 类型 | 销售 | 运费百分比 | 总费用 |
---|---|---|---|---|---|---|---|
Tom | 香蕉 | 01-01-2021 | 纽约 | 水果 | 120 | 0.01 | 1.2 |
Tom | 苹果 | 01-01-2021 | 纽约 | 水果 | 120 | 0.01 | 1.2 |
Tom | 香蕉 | 02-01-2021 | 德克萨斯 | 水果 | 420 | 0.01 | 4.2 |
Tom | 香蕉 | 02-01-2021 | 德克萨斯 | 水果 | 120 | 0.01 | 1.2 |
Mat | 香蕉 | 02-01-2021 | 纽约 | 水果 | 30 | 0.01 | 0.3 |
我想要一个“累积总计”列,但只考虑姓名和日期(作为分组列),并显示总费用列的唯一值之和。这将导致类似以下的结果:
姓名 | 产品 | 日期 | 地点 | 类型 | 销售 | 运费百分比 | 总费用 | 累积总计 |
---|---|---|---|---|---|---|---|---|
Tom | 香蕉 | 01-01-2021 | 纽约 | 水果 | 120 | 0.01 | 1.2 | 1.2 |
Tom | 苹果 | 01-01-2021 | 纽约 | 水果 | 120 | 0.01 | 1.2 | 1.2 |
Tom | 香蕉 | 02-01-2021 | 德克萨斯 | 水果 | 420 | 0.01 | 4.2 | 4.2 |
Tom | 香蕉 | 02-01-2021 | 德克萨斯 | 水果 | 120 | 0.01 | 1.2 | 5.4 |
Mat | 香蕉 | 02-01-2021 | 纽约 | 水果 | 30 | 0.01 | 0.3 | 0.3 |
我迷茫了,我还没有找到任何能给我这个结果的方法。
英文:
I am trying to calculate the running total across a few specific columns of my dataFrame and I am only interested in calculating using unique values.
I have below an example dataframe:
Name | Product | Date | Location | Type | Sales | Ship Fee % | Total Fee |
---|---|---|---|---|---|---|---|
Tom | Bananas | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 |
Tom | Apples | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 |
Tom | Bananas | 02-01-2021 | TX | Fruit | 420 | 0.01 | 4.2 |
Tom | Bananas | 02-01-2021 | TX | Fruit | 120 | 0.01 | 1.2 |
Mat | Bananas | 02-01-2021 | NY | Fruit | 30 | 0.01 | 0.3 |
I want to have a Running Total
column, but only considering the Name and Date (as groupBy columns) and showing the sum of the unique values of Total Fee column. That would result in something like this:
Name | Product | Date | Location | Type | Sales | Ship Fee % | Total Fee | Running Total |
---|---|---|---|---|---|---|---|---|
Tom | Bananas | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
Tom | Apples | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
Tom | Bananas | 02-01-2021 | TX | Fruit | 420 | 0.01 | 4.2 | 4.2 |
Tom | Bananas | 02-01-2021 | TX | Fruit | 120 | 0.01 | 1.2 | 5.4 |
Mat | Bananas | 02-01-2021 | NY | Fruit | 30 | 0.01 | 0.3 | 0.3 |
I am lost -> I haven't been able to find anything that can give me this result.
答案1
得分: 2
以下是翻译好的内容:
Option 1: 使用groupby
按“Name”和“Date”分组,然后仅对“Total Fee”的唯一值进行cumsum
操作
df['Running Total'] = df.drop_duplicates(['Name', 'Date', 'Total Fee']).groupby(['Name', 'Date'])['Total Fee'].cumsum()
df['Running Total'] = df['Running Total'].fillna(df['Total Fee'])
Option 2: 使用groupby
按“Name”、“Product”和“Date”分组,然后进行cumsum
操作,给出每个人每天每种产品的累积总额。
df['Running Total'] = df.groupby(['Name', 'Product', 'Date'], as_index=False)['Total Fee'].cumsum()
测试和示例
给定这个数据框:
Name | Product | Date | Location | Type | Sales | Ship Fee % | Total Fee | |
---|---|---|---|---|---|---|---|---|
0 | Tom | Bananas | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 |
1 | Tom | Apples | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 |
2 | Tom | Bananas | 02-01-2021 | TX | Fruit | 420 | 0.01 | 4.2 |
3 | Tom | Bananas | 02-01-2021 | TX | Fruit | 120 | 0.01 | 1.2 |
4 | Mat | Bananas | 02-01-2021 | NY | Fruit | 30 | 0.01 | 0.3 |
5 | Mat | Bananas | 02-01-2021 | NY | Fruit | 50 | 0.01 | 0.3 |
6 | Mat | Apples | 03-01-2021 | NY | Vegetable | 80 | 0.02 | 1.6 |
Option 1 结果:
Name | Product | Date | Location | Type | Sales | Ship Fee % | Total Fee | Running Total | |
---|---|---|---|---|---|---|---|---|---|
0 | Tom | Bananas | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
1 | Tom | Apples | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
2 | Tom | Bananas | 02-01-2021 | TX | Fruit | 420 | 0.01 | 4.2 | 4.2 |
3 | Tom | Bananas | 02-01-2021 | TX | Fruit | 120 | 0.01 | 1.2 | 5.4 |
4 | Mat | Bananas | 02-01-2021 | NY | Fruit | 30 | 0.01 | 0.3 | 0.3 |
5 | Mat | Bananas | 02-01-2021 | NY | Fruit | 50 | 0.01 | 0.3 | 0.3 |
6 | Mat | Apples | 03-01-2021 | NY | Vegetable | 80 | 0.02 | 1.6 | 1.6 |
Option 2 结果:
Name | Product | Date | Location | Type | Sales | Ship Fee % | Total Fee | Running Total | |
---|---|---|---|---|---|---|---|---|---|
0 | Tom | Bananas | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
1 | Tom | Apples | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
2 | Tom | Bananas | 02-01-2021 | TX | Fruit | 420 | 0.01 | 4.2 | 4.2 |
3 | Tom | Bananas | 02-01-2021 | TX | Fruit | 120 | 0.01 | 1.2 | 5.4 |
4 | Mat | Bananas | 02-01-2021 | NY | Fruit | 30 | 0.01 | 0.3 | 0.3 |
5 | Mat | Bananas | 02-01-2021 | NY | Fruit | 50 | 0.01 | 0.3 | 0.6 |
6 | Mat | Apples | 03-01-2021 | NY | Vegetable | 80 | 0.02 | 1.6 | 1.6 |
英文:
I think this is what you are looking for:
Option 1: groupby
"Name" and "Date" then cumsum
only unique values for Total Fee
df['Running Total'] = df.drop_duplicates(['Name', 'Date', 'Total Fee']).groupby(['Name', 'Date'])['Total Fee'].cumsum()
df['Running Total'] = df['Running Total'].fillna(df['Total Fee'])
Option 2: groupby
"Name", "Product", "Date". Then cumsum
--> gives the accumulated sum for each product on each day for each person.
df['Running Total'] = df.groupby(['Name', 'Product','Date'], as_index=False)['Total Fee'].cumsum()
Testing and examples
Given this dataframe:
Name | Product | Date | Location | Type | Sales | Ship Fee % | Total Fee | |
---|---|---|---|---|---|---|---|---|
0 | Tom | Bananas | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 |
1 | Tom | Apples | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 |
2 | Tom | Bananas | 02-01-2021 | TX | Fruit | 420 | 0.01 | 4.2 |
3 | Tom | Bananas | 02-01-2021 | TX | Fruit | 120 | 0.01 | 1.2 |
4 | Mat | Bananas | 02-01-2021 | NY | Fruit | 30 | 0.01 | 0.3 |
5 | Mat | Bananas | 02-01-2021 | NY | Fruit | 50 | 0.01 | 0.3 |
6 | Mat | Apples | 03-01-2021 | NY | Vegetable | 80 | 0.02 | 1.6 |
Option 1 result:
Name | Product | Date | Location | Type | Sales | Ship Fee % | Total Fee | Running Total | |
---|---|---|---|---|---|---|---|---|---|
0 | Tom | Bananas | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
1 | Tom | Apples | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
2 | Tom | Bananas | 02-01-2021 | TX | Fruit | 420 | 0.01 | 4.2 | 4.2 |
3 | Tom | Bananas | 02-01-2021 | TX | Fruit | 120 | 0.01 | 1.2 | 5.4 |
4 | Mat | Bananas | 02-01-2021 | NY | Fruit | 30 | 0.01 | 0.3 | 0.3 |
5 | Mat | Bananas | 02-01-2021 | NY | Fruit | 50 | 0.01 | 0.3 | 0.3 |
6 | Mat | Apples | 03-01-2021 | NY | Vegetable | 80 | 0.02 | 1.6 | 1.6 |
Option 2 result:
Name | Product | Date | Location | Type | Sales | Ship Fee % | Total Fee | Running Total | |
---|---|---|---|---|---|---|---|---|---|
0 | Tom | Bananas | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
1 | Tom | Apples | 01-01-2021 | NY | Fruit | 120 | 0.01 | 1.2 | 1.2 |
2 | Tom | Bananas | 02-01-2021 | TX | Fruit | 420 | 0.01 | 4.2 | 4.2 |
3 | Tom | Bananas | 02-01-2021 | TX | Fruit | 120 | 0.01 | 1.2 | 5.4 |
4 | Mat | Bananas | 02-01-2021 | NY | Fruit | 30 | 0.01 | 0.3 | 0.3 |
5 | Mat | Bananas | 02-01-2021 | NY | Fruit | 50 | 0.01 | 0.3 | 0.6 |
6 | Mat | Apples | 03-01-2021 | NY | Vegetable | 80 | 0.02 | 1.6 | 1.6 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论