英文:
How to calculate the Month to Date(MTD) and Year to Date(YTD) in a dataframe?
问题
我想计算给定数据框的MTD
和YTD
。我有两列名为bp
和actuals
。
bp
的MTD
和YTD
值是累积平均值。但是actuals
的MTD
和YTD
值是在不考虑零值的情况下的累积平均值。
例如,如果2023年4月1日的actuals
值为10,则其对应的mtd_actuals
列值将为10。2023年4月2日的actuals
值为20,则其对应的mtd_actuals
列值将为15。2023年4月3日的actuals
值为0,则其对应的mtd_actuals
列值将为15。
输入数据框:
data = {
'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
'actuals': [10, 20, 0, 0, 10]
}
输出数据框:
date category parameter ... actuals mtd_bp mtd_actuals ytd_bp ytd_actuals
2023-04-01 QUALITY PARAMETERS TBR ... 10 51 10.000000 51 10.000000
2023-04-02 QUALITY PARAMETERS TBR ... 20 51 15.000000 51 15.000000
2023-04-03 QUALITY PARAMETERS TBR ... 0 51 15.000000 51 15.000000
2023-04-04 QUALITY PARAMETERS TBR ... 0 51 15.000000 51 15.000000
2023-04-05 QUALITY PARAMETERS TBR ... 10 51 13.333330 51 13.333330
我已经使用以下代码来计算mtd_actuals
和ytd_actuals
:
df['mtd_bp'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
df['mtd_actuals'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
df['ytd_bp'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
df['ytd_actuals'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
在这里,mtd_bp
和ytd_bp
的值是正确的。但是mtd_actuals
和ytd_actuals
的值不正确。
有人可以提供解决方案或修改mtd_actuals
和ytd_actuals
的代码以获得正确的输出吗?
英文:
I want to calculate the MTD
and YTD
for the given dataframe.I have two columns named bp
and actuals
.
MTD
and YTD
values for bp
is the cumulative average.But the MTD
and YTD
values for actuals
is the cumulative average without taking the zero values.
For example, if actuals value for 2023-04-01 is 10, then its corresponding mtd_actuals
column value will be 10.actuals value for 2023-04-02 is 20, then its corresponding mtd_actuals
column value will be 15. actuals value for 2023-04-03 is 0, then its corresponding mtd_actuals
column value will be 15.
Input dataframe :
data = {
'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
'actuals': [10, 20, 0, 0, 10]
}
output dataframe:
date category parameter ... actuals mtd_bp mtd_actuals ytd_bp ytd_actuals
2023-04-01 QUALITY PARAMETERS TBR ... 10 51 10.000000 51 10.000000
2023-04-02 QUALITY PARAMETERS TBR ... 20 51 15.000000 51 15.000000
2023-04-03 QUALITY PARAMETERS TBR ... 0 51 15.000000 51 15.000000
2023-04-04 QUALITY PARAMETERS TBR ... 0 51 15.000000 51 15.000000
2023-04-05 QUALITY PARAMETERS TBR ... 10 51 13.333330 51 13.333330
I have used below code for calculating the mtd_actuals and ytd_actuals:
df['mtd_bp'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
df['mtd_actuals'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
df['ytd_bp'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
df['ytd_actuals'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
Here mtd_bp
and ytd_bp
values comes correctly.But mtd_actuals
and ytd_actuals
are not correct.
Can anyone suggest a solution or modify the mtd_actuals
and ytd_actuals
code to get the correct output?
答案1
得分: 1
示例
import pandas as pd
data1 = {'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05', '2023-05-01', '2023-05-02', '2023-05-03'],
'bp': [51, 51, 51, 51, 51, 51, 51, 51], 'actuals': [10, 20, 0, 0, 10, 20, 10, 0]}
df = pd.DataFrame(data1)
df
date bp actuals
0 2023-04-01 51 10
1 2023-04-02 51 20
2 2023-04-03 51 0
3 2023-04-04 51 0
4 2023-04-05 51 10
5 2023-05-01 51 20
6 2023-05-02 51 10
7 2023-05-03 51 0
步骤1
创建用于分组的时间段grouper
grouper = pd.to_datetime(df['date']).dt.to_period(freq='M')
grouper
0 2023-04
1 2023-04
2 2023-04
3 2023-04
4 2023-04
5 2023-05
6 2023-05
7 2023-05
Name: date, dtype: period[M]
步骤2
使用掩码和按grouper
分组并展开
df['actuals'].mask(df['actuals'].eq(0)).groupby(grouper).expanding().mean().droplevel(0)
输出:
0 10.000000
1 15.000000
2 15.000000
3 15.000000
4 13.333333
5 20.000000
6 15.000000
7 15.000000
让我们比较df
和输出。
df 输出
date bp actuals
0 2023-04-01 51 10 10.000000
1 2023-04-02 51 20 15.000000
2 2023-04-03 51 0 15.000000
3 2023-04-04 51 0 15.000000
4 2023-04-05 51 10 13.333333
5 2023-05-01 51 20 20.000000
6 2023-05-02 51 10 15.000000
7 2023-05-03 51 0 15.000000
我想要您创建您希望作为输出的列。
英文:
Your example was not ideal for showing the expansion by month, excluding 0, so I created a new one.
Example
import pandas as pd
data1 = {'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05', '2023-05-01', '2023-05-02', '2023-05-03'],
'bp': [51, 51, 51, 51, 51, 51, 51, 51], 'actuals': [10, 20, 0, 0, 10, 20, 10, 0]}
df = pd.DataFrame(data1)
df
date bp actuals
0 2023-04-01 51 10
1 2023-04-02 51 20
2 2023-04-03 51 0
3 2023-04-04 51 0
4 2023-04-05 51 10
5 2023-05-01 51 20
6 2023-05-02 51 10
7 2023-05-03 51 0
Step1
make period grouper for groupby
grouper = pd.to_datetime(df['date']).dt.to_period(freq='M')
grouper
0 2023-04
1 2023-04
2 2023-04
3 2023-04
4 2023-04
5 2023-05
6 2023-05
7 2023-05
Name: date, dtype: period[M]
Step2
mask & groupby by grouper & expanding
df['actuals'].mask(df['actuals'].eq(0)).groupby(grouper).expanding().mean().droplevel(0)
output:
0 10.000000
1 15.000000
2 15.000000
3 15.000000
4 13.333333
5 20.000000
6 15.000000
7 15.000000
Let's compare df and output.
df output
date bp actuals
0 2023-04-01 51 10 10.000000
1 2023-04-02 51 20 15.000000
2 2023-04-03 51 0 15.000000
3 2023-04-04 51 0 15.000000
4 2023-04-05 51 10 13.333333
5 2023-05-01 51 20 20.000000
6 2023-05-02 51 10 15.000000
7 2023-05-03 51 0 15.000000
I want you to create the columns you want as output.
答案2
得分: 0
以下是翻译好的代码部分:
另一种选择:
import pandas as pd
import numpy as np
data = {
'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
'actuals': [10, 20, 0, 0, 10]}
# 将数据转换为DataFrame
df = pd.DataFrame(data)
df['actuals'].replace(0, np.nan, inplace=True)
# 计算累积平均值并用前一个值填充NaN值
df['ytd'] = df['actuals'].expanding().mean().ffill()
print(df)
输出示例:
date category parameter region factory actuals ytd
0 2023-04-01 QUALITY PARAMETERS TBR DELHI ABC 10.0 10.000000
1 2023-04-02 QUALITY PARAMETERS TBR DELHI ABC 20.0 15.000000
2 2023-04-03 QUALITY PARAMETERS TBR DELHI ABC NaN 15.000000
3 2023-04-04 QUALITY PARAMETERS TBR DELHI ABC NaN 15.000000
4 2023-04-05 QUALITY PARAMETERS TBR DELHI ABC 10.0 13.333333
英文:
Another option:
import pandas as pd
import numpy as np
data = {
'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
'actuals': [10, 20, 0, 0, 10]}
# Convert data into DataFrame
df = pd.DataFrame(data)
df['actuals'].replace(0, np.nan, inplace=True)
# Compute the cumulative average and fill NaN values with the previous value
df['ytd'] = df['actuals'].expanding().mean().ffill()
print(df)
date category parameter region factory actuals ytd
0 2023-04-01 QUALITY PARAMETERS TBR DELHI ABC 10.0 10.000000
1 2023-04-02 QUALITY PARAMETERS TBR DELHI ABC 20.0 15.000000
2 2023-04-03 QUALITY PARAMETERS TBR DELHI ABC NaN 15.000000
3 2023-04-04 QUALITY PARAMETERS TBR DELHI ABC NaN 15.000000
4 2023-04-05 QUALITY PARAMETERS TBR DELHI ABC 10.0 13.333333
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论