X最近年度的均值,基于时间序列索引。

huangapple go评论92阅读模式
英文:

Mean of X most recent years based on time series index

问题

import numpy as np
import pandas as pd
np.random.seed(1)

rows, cols = 4950, 5
data = np.random.rand(rows, cols)
tidx = pd.date_range('2010-01-01', periods=rows, freq='D')
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'], index=tidx)
print(df)

对于时间序列中的每一天,我想计算前五年中相同日期的平均值。例如,对于日期 2023 年 1 月 1 日:取 2022 年 1 月 1 日、2021 年 1 月 1 日、2020 年 1 月 1 日、2019 年 1 月 1 日和 2018 年 1 月 1 日的平均值。我尝试了许多变体:

df.groupby([df.index.month, df.index.day]).mean()

但我无法让它在计算每一行的均值时只使用索引年份之前的五年。我想要添加五列,其中包含“滚动”平均值,并为每一列命名,例如“A_Avg”。


<details>
<summary>英文:</summary>

I have a time series of 10 years of days, with five columns of data.  Here is code to randomly generate:

import numpy as np
import pandas as pd
np.random.seed(1)

rows,cols = 4950,5
data = np.random.rand(rows,cols)
tidx = pd.date_range('2010-01-01', periods=rows, freq='D')
df = pd.DataFrame(data, columns=['a','b','c','d','e'], index=tidx)
print (df)


For each day in the time series I would like to calculate the mean for the same day in the PREVIOUS five years.  For example for the date 1/1/2023: take the average of 1/1/2022, 1/1/2021, 1/1/2020, 1/1/2019 and 1/1/2018.  I&#39;ve tried many variations on:

df.groupby([df.index.month, df.index.day]).mean()


But I can&#39;t get it use just the five years prior to the index.year in the mean calculation for each row.  I would like to add five columns with the &quot;rolling&quot; average, and name each column for example &quot;A_Avg&quot;



</details>


# 答案1
**得分**: 1

从您的DataFrame开始,我们可以首先将`Date`列转换为[`Datetime`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html):

```python
df['Date'] = pd.to_datetime(df['Date'], format="%d/%m/%Y")

然后我们将yearmonthyear分开成列:

df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day

接下来,我们根据daymonthyear重新排序行,并相应地重新索引DataFrame,如下所示:

df = df.sort_values(by=['day', 'month', 'year']).reset_index(drop=True)

最后,我们可以使用设置为5的rolling方法来获得每列的期望结果:

df['A_Avg'] = df.groupby(['day', 'month'], as_index=False)['A'].rolling(5).mean()['A']
df['B_Avg'] = df.groupby(['day', 'month'], as_index=False)['B'].rolling(5).mean()['B']
df['C_Avg'] = df.groupby(['day', 'month'], as_index=False)['C'].rolling(5).mean()['C']
df['D_Avg'] = df.groupby(['day', 'month'], as_index=False)['D'].rolling(5).mean()['D']
df['E_Avg'] = df.groupby(['day', 'month'], as_index=False)['E'].rolling(5).mean()['E']
英文:

From your DataFrame, we can start by convert the Date column as Datetime :

df[&#39;Date&#39;] = pd.to_datetime(df[&#39;Date&#39;], format=&quot;%d/%m/%Y&quot;)

Then we separate the year, month and year in columns :

df[&#39;year&#39;] = df[&#39;Date&#39;].dt.year
df[&#39;month&#39;] = df[&#39;Date&#39;].dt.month
df[&#39;day&#39;] = df[&#39;Date&#39;].dt.day

Next step, we reorder the row depending on day, month and year and reindex the DataFrame accordingly like so :

df = df.sort_values(by=[&#39;day&#39;, &#39;month&#39;, &#39;year&#39;]).reset_index(drop=True)

To finish, we can use the rolling method setup on 5 to get the expected result on each columns :

df[&#39;A_Avg&#39;] = df.groupby([&#39;day&#39;, &#39;month&#39;], as_index=False)[&#39;A&#39;].rolling(5).mean()[&#39;A&#39;]
df[&#39;B_Avg&#39;] = df.groupby([&#39;day&#39;, &#39;month&#39;], as_index=False)[&#39;B&#39;].rolling(5).mean()[&#39;B&#39;]
df[&#39;C_Avg&#39;] = df.groupby([&#39;day&#39;, &#39;month&#39;], as_index=False)[&#39;C&#39;].rolling(5).mean()[&#39;C&#39;]
df[&#39;D_Avg&#39;] = df.groupby([&#39;day&#39;, &#39;month&#39;], as_index=False)[&#39;D&#39;].rolling(5).mean()[&#39;D&#39;]
df[&#39;E_Avg&#39;] = df.groupby([&#39;day&#39;, &#39;month&#39;], as_index=False)[&#39;E&#39;].rolling(5).mean()[&#39;E&#39;]

答案2

得分: 1

这听起来你想要 rolling

mean_5y_prev = df.groupby([df.index.month, df.index.day]).rolling(5).mean()

然后你可以与 df 进行连接:

df.join(mean_5y_prev.droplevel([0, 1]), rsuffix='_Avg')
英文:

It sounds like you want rolling.

mean_5y_prev = df.groupby([df.index.month, df.index.day]).rolling(5).mean()

Then you can join back to df:

df.join(mean_5y_prev.droplevel([0, 1]), rsuffix=&#39;_Avg&#39;)

答案3

得分: 0

创建数据框。

df = pd.DataFrame({
    'Date': ['1/1/2023', '1/2/2023', '1/3/2023', '1/1/2022', '1/2/2022', '1/3/2022', '1/1/2015', '1/2/2015', '1/3/2015'],
    'A': [100, 105, 110, 100, 105, 110, 100, 105, 110],
    'B': [200, 205, 210, 200, 205, 210, 200, 205, 210],
    'C': [300, 305, 310, 300, 305, 310, 300, 305, 310],
    'D': [400, 405, 410, 400, 405, 410, 400, 405, 410],
    'E': [500, 505, 510, 500, 505, 510, 500, 505, 510],
})
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day

获取可以用来分割数据框的年份列表。

bins = [y for y in range(df['year'].max(), df['year'].min() - 5, -5)]
bins.reverse()
# [2013, 2018, 2023]

计算分组的均值。

df['year'] = pd.cut(df['year'], bins)
df = df.groupby(['year', 'month', 'day'], as_index=False).mean(numeric_only=True)

输出:

           year  month  day      A      B      C      D      E
0  (2013, 2018]      1    1  100.0  200.0  300.0  400.0  500.0
1  (2013, 2018]      1    2  105.0  205.0  305.0  405.0  505.0
2  (2013, 2018]      1    3  110.0  210.0  310.0  410.0  510.0
3  (2018, 2023]      1    1  100.0  200.0  300.0  400.0  500.0
4  (2018, 2023]      1    2  105.0  205.0  305.0  405.0  505.0
5  (2018, 2023]      1    3  110.0  210.0  310.0  410.0  510.0
英文:

Create dataframe.

df = pd.DataFrame({
    &#39;Date&#39;: [&#39;1/1/2023&#39;, &#39;1/2/2023&#39;, &#39;1/3/2023&#39;, &#39;1/1/2022&#39;, &#39;1/2/2022&#39;, &#39;1/3/2022&#39;, &#39;1/1/2015&#39;, &#39;1/2/2015&#39;, &#39;1/3/2015&#39;],
    &#39;A&#39;: [100, 105, 110, 100, 105, 110, 100, 105, 110],
    &#39;B&#39;: [200, 205, 210, 200, 205, 210, 200, 205, 210],
    &#39;C&#39;: [300, 305, 310, 300, 305, 310, 300, 305, 310],
    &#39;D&#39;: [400, 405, 410, 400, 405, 410, 400, 405, 410],
    &#39;E&#39;: [500, 505, 510, 500, 505, 510, 500, 505, 510],
})
df[&#39;Date&#39;] = pd.to_datetime(df[&#39;Date&#39;], format=&#39;%m/%d/%Y&#39;)
df[&#39;year&#39;] = df[&#39;Date&#39;].dt.year
df[&#39;month&#39;] = df[&#39;Date&#39;].dt.month
df[&#39;day&#39;] = df[&#39;Date&#39;].dt.day

Get a list of years which can be used to cut the dataframe.

bins = [y for y in range(df[&#39;year&#39;].max(), df[&#39;year&#39;].min() - 5, -5)]
bins.reverse()
# [2013, 2018, 2023]

Calculate the mean of groups.

df[&#39;year&#39;] = pd.cut(df[&#39;year&#39;], bins)
df = df.groupby([&#39;year&#39;, &#39;month&#39;, &#39;day&#39;], as_index=False).mean(numeric_only=True)

output:

           year  month  day      A      B      C      D      E
0  (2013, 2018]      1    1  100.0  200.0  300.0  400.0  500.0
1  (2013, 2018]      1    2  105.0  205.0  305.0  405.0  505.0
2  (2013, 2018]      1    3  110.0  210.0  310.0  410.0  510.0
3  (2018, 2023]      1    1  100.0  200.0  300.0  400.0  500.0
4  (2018, 2023]      1    2  105.0  205.0  305.0  405.0  505.0
5  (2018, 2023]      1    3  110.0  210.0  310.0  410.0  510.0

huangapple
  • 本文由 发表于 2023年8月4日 08:15:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76832262.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定