如何在数据框中计算截止到当月的月度累计(MTD)和年度累计(YTD)?

huangapple go评论157阅读模式
英文:

How to calculate the Month to Date(MTD) and Year to Date(YTD) in a dataframe?

问题

我想计算给定数据框的MTDYTD。我有两列名为bpactuals

bpMTDYTD值是累积平均值。但是actualsMTDYTD值是在不考虑零值的情况下的累积平均值。

例如,如果2023年4月1日的actuals值为10,则其对应的mtd_actuals列值将为10。2023年4月2日的actuals值为20,则其对应的mtd_actuals列值将为15。2023年4月3日的actuals值为0,则其对应的mtd_actuals列值将为15。

输入数据框:

  1. data = {
  2. 'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
  3. 'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
  4. 'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
  5. 'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
  6. 'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
  7. 'actuals': [10, 20, 0, 0, 10]
  8. }

输出数据框:

  1. date category parameter ... actuals mtd_bp mtd_actuals ytd_bp ytd_actuals
  2. 2023-04-01 QUALITY PARAMETERS TBR ... 10 51 10.000000 51 10.000000
  3. 2023-04-02 QUALITY PARAMETERS TBR ... 20 51 15.000000 51 15.000000
  4. 2023-04-03 QUALITY PARAMETERS TBR ... 0 51 15.000000 51 15.000000
  5. 2023-04-04 QUALITY PARAMETERS TBR ... 0 51 15.000000 51 15.000000
  6. 2023-04-05 QUALITY PARAMETERS TBR ... 10 51 13.333330 51 13.333330

我已经使用以下代码来计算mtd_actualsytd_actuals

  1. df['mtd_bp'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
  2. df['mtd_actuals'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
  3. df['ytd_bp'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
  4. df['ytd_actuals'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

在这里,mtd_bpytd_bp的值是正确的。但是mtd_actualsytd_actuals的值不正确。

有人可以提供解决方案或修改mtd_actualsytd_actuals的代码以获得正确的输出吗?

英文:

I want to calculate the MTD and YTD for the given dataframe.I have two columns named bp and actuals.

MTD and YTD values for bp is the cumulative average.But the MTD and YTD values for actuals is the cumulative average without taking the zero values.

For example, if actuals value for 2023-04-01 is 10, then its corresponding mtd_actuals column value will be 10.actuals value for 2023-04-02 is 20, then its corresponding mtd_actuals column value will be 15. actuals value for 2023-04-03 is 0, then its corresponding mtd_actuals column value will be 15.

Input dataframe :

  1. data = {
  2. 'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
  3. 'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
  4. 'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
  5. 'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
  6. 'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
  7. 'actuals': [10, 20, 0, 0, 10]

}

output dataframe:

  1. date category parameter ... actuals mtd_bp mtd_actuals ytd_bp ytd_actuals
  2. 2023-04-01 QUALITY PARAMETERS TBR ... 10 51 10.000000 51 10.000000
  3. 2023-04-02 QUALITY PARAMETERS TBR ... 20 51 15.000000 51 15.000000
  4. 2023-04-03 QUALITY PARAMETERS TBR ... 0 51 15.000000 51 15.000000
  5. 2023-04-04 QUALITY PARAMETERS TBR ... 0 51 15.000000 51 15.000000
  6. 2023-04-05 QUALITY PARAMETERS TBR ... 10 51 13.333330 51 13.333330

I have used below code for calculating the mtd_actuals and ytd_actuals:

  1. df['mtd_bp'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
  2. df['mtd_actuals'] = df.groupby([df.date.dt.to_period('m'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
  3. df['ytd_bp'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])
  4. df['ytd_actuals'] = df.groupby([df.date.dt.to_period('A-MAR'), 'category', 'parameter', 'region', 'factory']).actuals.expanding().mean().droplevel([0, 1, 2, 3, 4])

Here mtd_bp and ytd_bp values comes correctly.But mtd_actuals and ytd_actuals are not correct.

Can anyone suggest a solution or modify the mtd_actuals and ytd_actuals code to get the correct output?

答案1

得分: 1

示例

  1. import pandas as pd
  2. data1 = {'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05', '2023-05-01', '2023-05-02', '2023-05-03'],
  3. 'bp': [51, 51, 51, 51, 51, 51, 51, 51], 'actuals': [10, 20, 0, 0, 10, 20, 10, 0]}
  4. df = pd.DataFrame(data1)

df

  1. date bp actuals
  2. 0 2023-04-01 51 10
  3. 1 2023-04-02 51 20
  4. 2 2023-04-03 51 0
  5. 3 2023-04-04 51 0
  6. 4 2023-04-05 51 10
  7. 5 2023-05-01 51 20
  8. 6 2023-05-02 51 10
  9. 7 2023-05-03 51 0

步骤1

创建用于分组的时间段grouper

  1. grouper = pd.to_datetime(df['date']).dt.to_period(freq='M')

grouper

  1. 0 2023-04
  2. 1 2023-04
  3. 2 2023-04
  4. 3 2023-04
  5. 4 2023-04
  6. 5 2023-05
  7. 6 2023-05
  8. 7 2023-05
  9. Name: date, dtype: period[M]

步骤2

使用掩码和按grouper分组并展开

  1. df['actuals'].mask(df['actuals'].eq(0)).groupby(grouper).expanding().mean().droplevel(0)

输出:

  1. 0 10.000000
  2. 1 15.000000
  3. 2 15.000000
  4. 3 15.000000
  5. 4 13.333333
  6. 5 20.000000
  7. 6 15.000000
  8. 7 15.000000

让我们比较df和输出。

  1. df 输出
  2. date bp actuals
  3. 0 2023-04-01 51 10 10.000000
  4. 1 2023-04-02 51 20 15.000000
  5. 2 2023-04-03 51 0 15.000000
  6. 3 2023-04-04 51 0 15.000000
  7. 4 2023-04-05 51 10 13.333333
  8. 5 2023-05-01 51 20 20.000000
  9. 6 2023-05-02 51 10 15.000000
  10. 7 2023-05-03 51 0 15.000000

我想要您创建您希望作为输出的列。

英文:

Your example was not ideal for showing the expansion by month, excluding 0, so I created a new one.

Example

  1. import pandas as pd
  2. data1 = {'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05', '2023-05-01', '2023-05-02', '2023-05-03'],
  3. 'bp': [51, 51, 51, 51, 51, 51, 51, 51], 'actuals': [10, 20, 0, 0, 10, 20, 10, 0]}
  4. df = pd.DataFrame(data1)

df

  1. date bp actuals
  2. 0 2023-04-01 51 10
  3. 1 2023-04-02 51 20
  4. 2 2023-04-03 51 0
  5. 3 2023-04-04 51 0
  6. 4 2023-04-05 51 10
  7. 5 2023-05-01 51 20
  8. 6 2023-05-02 51 10
  9. 7 2023-05-03 51 0

Step1

make period grouper for groupby

  1. grouper = pd.to_datetime(df['date']).dt.to_period(freq='M')

grouper

  1. 0 2023-04
  2. 1 2023-04
  3. 2 2023-04
  4. 3 2023-04
  5. 4 2023-04
  6. 5 2023-05
  7. 6 2023-05
  8. 7 2023-05
  9. Name: date, dtype: period[M]

Step2

mask & groupby by grouper & expanding

  1. df['actuals'].mask(df['actuals'].eq(0)).groupby(grouper).expanding().mean().droplevel(0)

output:

  1. 0 10.000000
  2. 1 15.000000
  3. 2 15.000000
  4. 3 15.000000
  5. 4 13.333333
  6. 5 20.000000
  7. 6 15.000000
  8. 7 15.000000

Let's compare df and output.

  1. df output
  2. date bp actuals
  3. 0 2023-04-01 51 10 10.000000
  4. 1 2023-04-02 51 20 15.000000
  5. 2 2023-04-03 51 0 15.000000
  6. 3 2023-04-04 51 0 15.000000
  7. 4 2023-04-05 51 10 13.333333
  8. 5 2023-05-01 51 20 20.000000
  9. 6 2023-05-02 51 10 15.000000
  10. 7 2023-05-03 51 0 15.000000

I want you to create the columns you want as output.

答案2

得分: 0

以下是翻译好的代码部分:

  1. 另一种选择
  2. import pandas as pd
  3. import numpy as np
  4. data = {
  5. 'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
  6. 'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
  7. 'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
  8. 'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
  9. 'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
  10. 'actuals': [10, 20, 0, 0, 10]}
  11. # 将数据转换为DataFrame
  12. df = pd.DataFrame(data)
  13. df['actuals'].replace(0, np.nan, inplace=True)
  14. # 计算累积平均值并用前一个值填充NaN值
  15. df['ytd'] = df['actuals'].expanding().mean().ffill()
  16. print(df)

输出示例:

  1. date category parameter region factory actuals ytd
  2. 0 2023-04-01 QUALITY PARAMETERS TBR DELHI ABC 10.0 10.000000
  3. 1 2023-04-02 QUALITY PARAMETERS TBR DELHI ABC 20.0 15.000000
  4. 2 2023-04-03 QUALITY PARAMETERS TBR DELHI ABC NaN 15.000000
  5. 3 2023-04-04 QUALITY PARAMETERS TBR DELHI ABC NaN 15.000000
  6. 4 2023-04-05 QUALITY PARAMETERS TBR DELHI ABC 10.0 13.333333
英文:

Another option:

  1. import pandas as pd
  2. import numpy as np
  3. data = {
  4. 'date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'],
  5. 'category': ['QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS', 'QUALITY PARAMETERS'],
  6. 'parameter': ['TBR', 'TBR', 'TBR', 'TBR', 'TBR'],
  7. 'region': ['DELHI', 'DELHI', 'DELHI', 'DELHI', 'DELHI'],
  8. 'factory': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC'],
  9. 'actuals': [10, 20, 0, 0, 10]}
  10. # Convert data into DataFrame
  11. df = pd.DataFrame(data)
  12. df['actuals'].replace(0, np.nan, inplace=True)
  13. # Compute the cumulative average and fill NaN values with the previous value
  14. df['ytd'] = df['actuals'].expanding().mean().ffill()
  15. print(df)
  16. date category parameter region factory actuals ytd
  17. 0 2023-04-01 QUALITY PARAMETERS TBR DELHI ABC 10.0 10.000000
  18. 1 2023-04-02 QUALITY PARAMETERS TBR DELHI ABC 20.0 15.000000
  19. 2 2023-04-03 QUALITY PARAMETERS TBR DELHI ABC NaN 15.000000
  20. 3 2023-04-04 QUALITY PARAMETERS TBR DELHI ABC NaN 15.000000
  21. 4 2023-04-05 QUALITY PARAMETERS TBR DELHI ABC 10.0 13.333333

huangapple
  • 本文由 发表于 2023年8月9日 18:17:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76866778.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定