计算数据框中相同日期和小时的每两周滚动平均值。

huangapple go评论87阅读模式
英文:

Calculate the rolling average every two weeks for the same day and hour in a DataFrame

问题

我有一个类似以下的数据框:

  1. df = pd.DataFrame()
  2. df['datetime'] = pd.date_range(start='2023-1-2', end='2023-1-29', freq='15min')
  3. df['week'] = df['datetime'].apply(lambda x: int(x.isocalendar()[1]))
  4. df['day_of_week'] = df['datetime'].dt.weekday
  5. df['hour'] = df['datetime'].dt.hour
  6. df['minutes'] = pd.DatetimeIndex(df['datetime']).minute
  7. df['value'] = range(len(df))
  8. df.set_index('datetime', inplace=True)

我想要计算相同小时/分钟/日的"value"列的平均值,每两周连续的一组。我希望得到以下结果:

  1. df=
  2. value
  3. day_of_week hour minutes datetime
  4. 0 0 0 2023-01-02 00:00:00 NaN
  5. 2023-01-09 00:00:00 NaN
  6. 2023-01-16 00:00:00 336
  7. 2023-01-23 00:00:00 1008
  8. 15 2023-01-02 00:15:00 NaN
  9. 2023-01-09 00:15:00 NaN
  10. 2023-01-16 00:15:00 337
  11. 2023-01-23 00:15:00 1009

所以前两周应该有NaN值,第三周应该是第一周和第二周的平均值,然后第四周应该是第二周和第三周的平均值,以此类推。我尝试了以下代码,但它似乎不符合我的预期:

  1. df = pd.DataFrame(df.groupby(['day_of_week', 'hour', 'minutes'])['value'].rolling(window='14D', min_periods=1).mean())

因为我得到的结果是:

  1. value
  2. day_of_week hour minutes datetime
  3. 0 0 0 2023-01-02 00:00:00 0
  4. 2023-01-09 00:00:00 336
  5. 2023-01-16 00:00:00 1008
  6. 2023-01-23 00:00:00 1680
  7. 15 2023-01-02 00:15:00 1
  8. 2023-01-09 00:15:00 337
  9. 2023-01-16 00:15:00 1009
  10. 2023-01-23 00:15:00 1681

我认为你可以尝试以下代码来获得你想要的结果:

  1. # 计算每两周的平均值
  2. df['average_value'] = df.groupby(['day_of_week', 'hour', 'minutes'])['value'].rolling(window=14, min_periods=1).mean().reset_index(level=0, drop=True)
  3. # 将结果重塑为你想要的形式
  4. result = df[['average_value']].unstack(0)
  5. # 重新命名列
  6. result.columns = [f'week-{i}' for i in range(1, len(result.columns) + 1)]
  7. # 重置索引
  8. result = result.reset_index()
  9. result = result.rename_axis(None, axis=1)
  10. # 创建目标日期列表
  11. target_dates = pd.date_range(start='2023-01-02', end='2023-01-29', freq='D')
  12. # 将目标日期与结果合并
  13. result['datetime'] = target_dates
  14. result.set_index('datetime', inplace=True)
  15. # 移动结果列以匹配你的期望
  16. result = result[['day_of_week', 'hour', 'minutes'] + [f'week-{i}' for i in range(1, len(result.columns))]]
  17. # 填充NaN值
  18. result = result.fillna(method='ffill')
  19. # 打印结果
  20. print(result)

这应该给你想要的结果。

英文:

I have a Dataframe like the following:

  1. df = pd.DataFrame()
  2. df['datetime'] = pd.date_range(start='2023-1-2', end='2023-1-29', freq='15min')
  3. df['week'] = df['datetime'].apply(lambda x: int(x.isocalendar()[1]))
  4. df['day_of_week'] = df['datetime'].dt.weekday
  5. df['hour'] = df['datetime'].dt.hour
  6. df['minutes'] = pd.DatetimeIndex(df['datetime']).minute
  7. df['value'] = range(len(df))
  8. df.set_index('datetime',inplace=True)
  9. df = week day_of_week hour minutes value
  10. datetime
  11. 2023-01-02 00:00:00 1 0 0 0 0
  12. 2023-01-02 00:15:00 1 0 0 15 1
  13. 2023-01-02 00:30:00 1 0 0 30 2
  14. 2023-01-02 00:45:00 1 0 0 45 3
  15. 2023-01-02 01:00:00 1 0 1 0 4
  16. ... ... ... ... ... ...
  17. 2023-01-08 23:00:00 1 6 23 0 668
  18. 2023-01-08 23:15:00 1 6 23 15 669
  19. 2023-01-08 23:30:00 1 6 23 30 670
  20. 2023-01-08 23:45:00 1 6 23 45 671
  21. 2023-01-09 00:00:00 2 0 0 0 672

And I want to calculate the average of the column "value" for the same hour/minute/day, every two consecutive weeks.

What I would like to get is the following:

  1. df=
  2. value
  3. day_of_week hour minutes datetime
  4. 0 0 0 2023-01-02 00:00:00 NaN
  5. 2023-01-09 00:00:00 NaN
  6. 2023-01-16 00:00:00 336
  7. 2023-01-23 00:00:00 1008
  8. 15 2023-01-02 00:15:00 NaN
  9. 2023-01-09 00:15:00 NaN
  10. 2023-01-16 00:15:00 337
  11. 2023-01-23 00:15:00 1009

So the first two weeks should have NaN values and week-3 should be the average of week-1 and week-2 and then week-4 the average of week-2 and week-3 and so on.

I tried the following code but it does not seem to do what I expect:

  1. df = pd.DataFrame(df.groupby(['day_of_week','hour','minutes'])['value'].rolling(window='14D', min_periods=1).mean())

As what I am getting is:

  1. value
  2. day_of_week hour minutes. datetime
  3. 0 0 0 2023-01-02 00:00:00 0
  4. 2023-01-09 00:00:00 336
  5. 2023-01-16 00:00:00 1008
  6. 2023-01-23 00:00:00 1680
  7. 15 2023-01-02 00:15:00 1
  8. 2023-01-09 00:15:00 337
  9. 2023-01-16 00:15:00 1009
  10. 2023-01-23 00:15:00 1681

答案1

得分: 1

我认为你想要在每个分组内进行位移。然后你需要另一个 groupby:

  1. (df.groupby(['day_of_week', 'hour', 'minutes'])['value']
  2. .rolling(window='14D', min_periods=2).mean() # `min_periods` 不同
  3. .groupby(['day_of_week', 'hour', 'minutes']).shift() # 在每个分组内进行位移
  4. .to_frame()
  5. )

输出:

  1. value
  2. day_of_week hour minutes datetime
  3. 0 0 0 2023-01-02 00:00:00 NaN
  4. 2023-01-09 00:00:00 NaN
  5. 2023-01-16 00:00:00 336.0
  6. 2023-01-23 00:00:00 1008.0
  7. 15 2023-01-02 00:15:00 NaN
  8. ...
  9. 6 23 30 2023-01-15 23:30:00 NaN
  10. 2023-01-22 23:30:00 1006.0
  11. 45 2023-01-08 23:45:00 NaN
  12. 2023-01-15 23:45:00 NaN
  13. 2023-01-22 23:45:00 1007.0
英文:

I think you want to shift within each group. Then you need another groupby:

  1. (df.groupby(['day_of_week','hour','minutes'])['value']
  2. .rolling(window='14D', min_periods=2).mean() # `min_periods` is different
  3. .groupby(['day_of_week','hour','minutes']).shift() # shift within each group
  4. .to_frame()
  5. )

Output:

  1. value
  2. day_of_week hour minutes datetime
  3. 0 0 0 2023-01-02 00:00:00 NaN
  4. 2023-01-09 00:00:00 NaN
  5. 2023-01-16 00:00:00 336.0
  6. 2023-01-23 00:00:00 1008.0
  7. 15 2023-01-02 00:15:00 NaN
  8. ... ...
  9. 6 23 30 2023-01-15 23:30:00 NaN
  10. 2023-01-22 23:30:00 1006.0
  11. 45 2023-01-08 23:45:00 NaN
  12. 2023-01-15 23:45:00 NaN
  13. 2023-01-22 23:45:00 1007.0

huangapple
  • 本文由 发表于 2023年2月10日 02:58:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75403238.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定