2023年2月10日 02:58:58go评论131阅读模式

英文:

Calculate the rolling average every two weeks for the same day and hour in a DataFrame

问题

我有一个类似以下的数据框：

df = pd.DataFrame()
df['datetime'] = pd.date_range(start='2023-1-2', end='2023-1-29', freq='15min')
df['week'] = df['datetime'].apply(lambda x: int(x.isocalendar()[1]))
df['day_of_week'] = df['datetime'].dt.weekday
df['hour'] = df['datetime'].dt.hour
df['minutes'] = pd.DatetimeIndex(df['datetime']).minute
df['value'] = range(len(df))
df.set_index('datetime', inplace=True)

我想要计算相同小时/分钟/日的"value"列的平均值，每两周连续的一组。我希望得到以下结果：

df=
                               value
day_of_week	hour minutes	datetime	
0	        0	0	2023-01-02 00:00:00	NaN
			2023-01-09 00:00:00	NaN
			2023-01-16 00:00:00	336
			2023-01-23 00:00:00	1008
		15	2023-01-02 00:15:00	NaN
			2023-01-09 00:15:00	NaN
			2023-01-16 00:15:00	337
			2023-01-23 00:15:00	1009

所以前两周应该有NaN值，第三周应该是第一周和第二周的平均值，然后第四周应该是第二周和第三周的平均值，以此类推。我尝试了以下代码，但它似乎不符合我的预期：

df = pd.DataFrame(df.groupby(['day_of_week', 'hour', 'minutes'])['value'].rolling(window='14D', min_periods=1).mean())

因为我得到的结果是：

value
day_of_week	hour minutes	datetime	
0	        0	0	2023-01-02 00:00:00	0
			2023-01-09 00:00:00	336
			2023-01-16 00:00:00	1008
			2023-01-23 00:00:00	1680
		15	2023-01-02 00:15:00	1
			2023-01-09 00:15:00	337
			2023-01-16 00:15:00	1009
			2023-01-23 00:15:00	1681

我认为你可以尝试以下代码来获得你想要的结果：

# 计算每两周的平均值
df['average_value'] = df.groupby(['day_of_week', 'hour', 'minutes'])['value'].rolling(window=14, min_periods=1).mean().reset_index(level=0, drop=True)

# 将结果重塑为你想要的形式
result = df[['average_value']].unstack(0)

# 重新命名列
result.columns = [f'week-{i}' for i in range(1, len(result.columns) + 1)]

# 重置索引
result = result.reset_index()
result = result.rename_axis(None, axis=1)

# 创建目标日期列表
target_dates = pd.date_range(start='2023-01-02', end='2023-01-29', freq='D')

# 将目标日期与结果合并
result['datetime'] = target_dates
result.set_index('datetime', inplace=True)

# 移动结果列以匹配你的期望
result = result[['day_of_week', 'hour', 'minutes'] + [f'week-{i}' for i in range(1, len(result.columns))]]

# 填充NaN值
result = result.fillna(method='ffill')

# 打印结果
print(result)

这应该给你想要的结果。

英文:

I have a Dataframe like the following:

df = pd.DataFrame()
df[&#39;datetime&#39;] = pd.date_range(start=&#39;2023-1-2&#39;, end=&#39;2023-1-29&#39;, freq=&#39;15min&#39;)
df[&#39;week&#39;] = df[&#39;datetime&#39;].apply(lambda x: int(x.isocalendar()[1]))
df[&#39;day_of_week&#39;] = df[&#39;datetime&#39;].dt.weekday
df[&#39;hour&#39;] = df[&#39;datetime&#39;].dt.hour
df[&#39;minutes&#39;] = pd.DatetimeIndex(df[&#39;datetime&#39;]).minute
df[&#39;value&#39;] = range(len(df))
df.set_index(&#39;datetime&#39;,inplace=True)


  df =     	       	    week day_of_week hour minutes value
    datetime					
    2023-01-02 00:00:00	1	0	0	0	0
    2023-01-02 00:15:00	1	0	0	15	1
    2023-01-02 00:30:00	1	0	0	30	2
    2023-01-02 00:45:00	1	0	0	45	3
    2023-01-02 01:00:00	1	0	1	0	4
    ...	...	...	...	...	...
    2023-01-08 23:00:00	1	6	23	0	668
    2023-01-08 23:15:00	1	6	23	15	669
    2023-01-08 23:30:00	1	6	23	30	670
    2023-01-08 23:45:00	1	6	23	45	671
    2023-01-09 00:00:00	2	0	0	0	672

And I want to calculate the average of the column "value" for the same hour/minute/day, every two consecutive weeks.

What I would like to get is the following:

df=
    				                                value
    day_of_week	hour minutes	datetime	
              0	   0	   0	2023-01-02 00:00:00	NaN
                                2023-01-09 00:00:00	NaN
                                2023-01-16 00:00:00	336
                                2023-01-23 00:00:00	1008
                           15	2023-01-02 00:15:00	NaN
                                2023-01-09 00:15:00 NaN
                                2023-01-16 00:15:00 337
                                2023-01-23 00:15:00 1009

So the first two weeks should have NaN values and week-3 should be the average of week-1 and week-2 and then week-4 the average of week-2 and week-3 and so on.

I tried the following code but it does not seem to do what I expect:

df = pd.DataFrame(df.groupby([&#39;day_of_week&#39;,&#39;hour&#39;,&#39;minutes&#39;])[&#39;value&#39;].rolling(window=&#39;14D&#39;, min_periods=1).mean())

As what I am getting is:

				                                value
day_of_week	hour minutes.  datetime	
0	        0	 0	       2023-01-02 00:00:00	0
                           2023-01-09 00:00:00	336
                           2023-01-16 00:00:00	1008
                           2023-01-23 00:00:00	1680
                 15	       2023-01-02 00:15:00	1
                           2023-01-09 00:15:00	337
                           2023-01-16 00:15:00	1009
                           2023-01-23 00:15:00	1681

答案1

得分: 1

我认为你想要在每个分组内进行位移。然后你需要另一个 groupby：

(df.groupby(['day_of_week', 'hour', 'minutes'])['value']
   .rolling(window='14D', min_periods=2).mean()         # `min_periods` 不同
   .groupby(['day_of_week', 'hour', 'minutes']).shift()   # 在每个分组内进行位移
   .to_frame()
)

输出：

                                              value
day_of_week hour minutes datetime                  
0           0    0       2023-01-02 00:00:00    NaN
                         2023-01-09 00:00:00    NaN
                         2023-01-16 00:00:00  336.0
                         2023-01-23 00:00:00 1008.0
                 15      2023-01-02 00:15:00    NaN
...
6           23   30      2023-01-15 23:30:00    NaN
                         2023-01-22 23:30:00 1006.0
                 45      2023-01-08 23:45:00    NaN
                         2023-01-15 23:45:00    NaN
                         2023-01-22 23:45:00 1007.0

英文:

I think you want to shift within each group. Then you need another groupby:

(df.groupby([&#39;day_of_week&#39;,&#39;hour&#39;,&#39;minutes&#39;])[&#39;value&#39;]
   .rolling(window=&#39;14D&#39;, min_periods=2).mean()         # `min_periods` is different
   .groupby([&#39;day_of_week&#39;,&#39;hour&#39;,&#39;minutes&#39;]).shift()   # shift within each group
   .to_frame()
)

Output:

                                               value
day_of_week hour minutes datetime                   
0           0    0       2023-01-02 00:00:00     NaN
                         2023-01-09 00:00:00     NaN
                         2023-01-16 00:00:00   336.0
                         2023-01-23 00:00:00  1008.0
                 15      2023-01-02 00:15:00     NaN
...                                              ...
6           23   30      2023-01-15 23:30:00     NaN
                         2023-01-22 23:30:00  1006.0
                 45      2023-01-08 23:45:00     NaN
                         2023-01-15 23:45:00     NaN
                         2023-01-22 23:45:00  1007.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

计算数据框中相同日期和小时的每两周滚动平均值。

问题

答案1

我如何高效地合并这些具有范围值的数据框？

如果’W’在’X’中，将’Y’添加到’Z’。

df.apply(hurst_function) 报错：必须是实数，而不是元组，在 Python 中。

在SAS EG4中，我需要找到按地区排名的前10个国家名称内的前10个客户名称。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论