如何在数据框中计算截止到当月的月度累计(MTD)和年度累计(YTD)?

huangapple go评论105阅读模式
英文:

How to calculate the Month to Date(MTD) and Year to Date(YTD) in a dataframe?

问题

我想计算给定数据框的MTDYTD。我有两列名为bpactuals

bpMTDYTD值是累积平均值。但是actualsMTDYTD值是在不考虑零值的情况下的累积平均值。

例如,如果2023年4月1日的actuals值为10,则其对应的mtd_actuals列值将为10。2023年4月2日的actuals值为20,则其对应的mtd_actuals列值将为15。2023年4月3日的actuals值为0,则其对应的mtd_actuals列值将为15。

输入数据框:

data = {
    'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
    'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
    'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
    'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
    'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
    'actuals': [10, 20, 0, 0, 10]
}

输出数据框:

date            category        parameter ...   actuals mtd_bp mtd_actuals ytd_bp ytd_actuals
2023-04-01  QUALITY PARAMETERS       TBR  ...      10     51   10.000000   51   10.000000 
2023-04-02  QUALITY PARAMETERS       TBR  ...      20     51   15.000000   51   15.000000
2023-04-03  QUALITY PARAMETERS       TBR  ...       0     51   15.000000   51   15.000000
2023-04-04  QUALITY PARAMETERS       TBR  ...       0     51   15.000000   51   15.000000
2023-04-05  QUALITY PARAMETERS       TBR  ...      10     51   13.333330   51   13.333330

我已经使用以下代码来计算mtd_actualsytd_actuals

df['mtd_bp'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

df['mtd_actuals'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

df['ytd_bp'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

df['ytd_actuals'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

在这里,mtd_bpytd_bp的值是正确的。但是mtd_actualsytd_actuals的值不正确。

有人可以提供解决方案或修改mtd_actualsytd_actuals的代码以获得正确的输出吗?

英文:

I want to calculate the MTD and YTD for the given dataframe.I have two columns named bp and actuals.

MTD and YTD values for bp is the cumulative average.But the MTD and YTD values for actuals is the cumulative average without taking the zero values.

For example, if actuals value for 2023-04-01 is 10, then its corresponding mtd_actuals column value will be 10.actuals value for 2023-04-02 is 20, then its corresponding mtd_actuals column value will be 15. actuals value for 2023-04-03 is 0, then its corresponding mtd_actuals column value will be 15.

Input dataframe :

data = {
'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
'actuals': [10, 20, 0, 0, 10]

}

output dataframe:

date            category        parameter ...   actuals mtd_bp mtd_actuals ytd_bp ytd_actuals
2023-04-01  QUALITY PARAMETERS       TBR  ...      10     51   10.000000   51   10.000000 
2023-04-02  QUALITY PARAMETERS       TBR  ...      20     51   15.000000   51   15.000000
2023-04-03  QUALITY PARAMETERS       TBR  ...       0     51   15.000000   51   15.000000
2023-04-04  QUALITY PARAMETERS       TBR  ...       0     51   15.000000   51   15.000000
2023-04-05  QUALITY PARAMETERS       TBR  ...      10     51   13.333330   51   13.333330

I have used below code for calculating the mtd_actuals and ytd_actuals:

df['mtd_bp'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

df['mtd_actuals'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

df['ytd_bp'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

df['ytd_actuals'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

Here mtd_bp and ytd_bp values comes correctly.But mtd_actuals and ytd_actuals are not correct.

Can anyone suggest a solution or modify the mtd_actuals and ytd_actuals code to get the correct output?

答案1

得分: 1

示例

import pandas as pd
data1 = {'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05', '2023-05-01', '2023-05-02', '2023-05-03'], 
         'bp': [51, 51, 51, 51, 51, 51, 51, 51], 'actuals': [10, 20, 0, 0, 10, 20, 10, 0]}
df = pd.DataFrame(data1)

df

       date  bp  actuals
0  2023-04-01  51       10
1  2023-04-02  51       20
2  2023-04-03  51        0
3  2023-04-04  51        0
4  2023-04-05  51       10
5  2023-05-01  51       20
6  2023-05-02  51       10
7  2023-05-03  51        0

步骤1

创建用于分组的时间段grouper

grouper = pd.to_datetime(df['date']).dt.to_period(freq='M')

grouper

0    2023-04
1    2023-04
2    2023-04
3    2023-04
4    2023-04
5    2023-05
6    2023-05
7    2023-05
Name: date, dtype: period[M]

步骤2

使用掩码和按grouper分组并展开

df['actuals'].mask(df['actuals'].eq(0)).groupby(grouper).expanding().mean().droplevel(0)

输出:

0    10.000000
1    15.000000
2    15.000000
3    15.000000
4    13.333333
5    20.000000
6    15.000000
7    15.000000

让我们比较df和输出。

         df           输出
       date  bp  actuals           
0  2023-04-01  51       10   10.000000
1  2023-04-02  51       20   15.000000
2  2023-04-03  51        0   15.000000
3  2023-04-04  51        0   15.000000
4  2023-04-05  51       10   13.333333

5  2023-05-01  51       20   20.000000
6  2023-05-02  51       10   15.000000
7  2023-05-03  51        0   15.000000

我想要您创建您希望作为输出的列。

英文:

Your example was not ideal for showing the expansion by month, excluding 0, so I created a new one.

Example

import pandas as pd
data1 = {'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05', '2023-05-01', '2023-05-02', '2023-05-03'], 
         'bp': [51, 51, 51, 51, 51, 51, 51, 51], 'actuals': [10, 20, 0, 0, 10, 20, 10, 0]}
df = pd.DataFrame(data1)

df

    date	    bp	actuals
0	2023-04-01	51	10
1	2023-04-02	51	20
2	2023-04-03	51	0
3	2023-04-04	51	0
4	2023-04-05	51	10
5	2023-05-01	51	20
6	2023-05-02	51	10
7	2023-05-03	51	0

Step1

make period grouper for groupby

grouper = pd.to_datetime(df['date']).dt.to_period(freq='M')

grouper

0    2023-04
1    2023-04
2    2023-04
3    2023-04
4    2023-04
5    2023-05
6    2023-05
7    2023-05
Name: date, dtype: period[M]

Step2

mask & groupby by grouper & expanding

df['actuals'].mask(df['actuals'].eq(0)).groupby(grouper).expanding().mean().droplevel(0)

output:

0    10.000000
1    15.000000
2    15.000000
3    15.000000
4    13.333333
5    20.000000
6    15.000000
7    15.000000

Let's compare df and output.

         df                         output
    date	    bp	actuals
0	2023-04-01	51	10	            10.000000
1	2023-04-02	51	20	            15.000000
2	2023-04-03	51	0	            15.000000
3	2023-04-04	51	0	            15.000000
4	2023-04-05	51	10	            13.333333

5	2023-05-01	51	20	            20.000000
6	2023-05-02	51	10	            15.000000
7	2023-05-03	51	0	            15.000000

I want you to create the columns you want as output.

答案2

得分: 0

以下是翻译好的代码部分:

另一种选择

import pandas as pd
import numpy as np

data = {
'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
'actuals': [10, 20, 0, 0, 10]}

# 将数据转换为DataFrame
df = pd.DataFrame(data)

df['actuals'].replace(0, np.nan, inplace=True)

# 计算累积平均值并用前一个值填充NaN值
df['ytd'] = df['actuals'].expanding().mean().ffill()

print(df)

输出示例:

           date           category parameter region factory  actuals        ytd
0  2023-04-01  QUALITY PARAMETERS       TBR  DELHI     ABC     10.0  10.000000
1  2023-04-02  QUALITY PARAMETERS       TBR  DELHI     ABC     20.0  15.000000
2  2023-04-03  QUALITY PARAMETERS       TBR  DELHI     ABC      NaN  15.000000
3  2023-04-04  QUALITY PARAMETERS       TBR  DELHI     ABC      NaN  15.000000
4  2023-04-05  QUALITY PARAMETERS       TBR  DELHI     ABC     10.0  13.333333
英文:

Another option:

import pandas as pd
import numpy as np

data = {
'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
'actuals': [10, 20, 0, 0, 10]}

# Convert data into DataFrame
df = pd.DataFrame(data)

df['actuals'].replace(0, np.nan, inplace=True)

# Compute the cumulative average and fill NaN values with the previous value
df['ytd'] = df['actuals'].expanding().mean().ffill()


print(df)

date            category parameter region factory  actuals        ytd
0  2023-04-01  QUALITY PARAMETERS       TBR  DELHI     ABC     10.0  10.000000
1  2023-04-02  QUALITY PARAMETERS       TBR  DELHI     ABC     20.0  15.000000
2  2023-04-03  QUALITY PARAMETERS       TBR  DELHI     ABC      NaN  15.000000
3  2023-04-04  QUALITY PARAMETERS       TBR  DELHI     ABC      NaN  15.000000
4  2023-04-05  QUALITY PARAMETERS       TBR  DELHI     ABC     10.0  13.333333

huangapple
  • 本文由 发表于 2023年8月9日 18:17:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76866778.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定