Pandas线性插值能捕捉季节性模式吗?

huangapple go评论78阅读模式
英文:

Can Pandas Linear interpolation capture seasonality patterns?

问题

我有一个时间序列数据集,包含29,184小时的数据。其中大约有1k+行的缺失值。

这是数据集的链接:dropbox

这是数据集的一个小预览:

                     NUMBER_OF_VEHICLES
DATE_TIME                              
2020-01-01 00:00:00                39.0
2020-01-01 01:00:00                 3.0
2020-01-01 02:00:00                 1.0
2020-01-01 03:00:00                 1.0
2020-01-01 04:00:00                 2.0

这是一个显示缺失值的图表的一部分:
Pandas线性插值能捕捉季节性模式吗?

NUMBER_OF_VEHICLES 包含了小时和每周的季节性。我试图以一种能够捕捉小时和每周季节性的方式填充缺失值。这里是我尝试的两种方法:

方法1

# 保存缺失值的索引
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index

# 从日期时间索引创建HOUR和DAYOFWEEK特征
df= (
    df.assign(HOUR=lambda x: x.index.hour,
              DAYOFWEEK=lambda x: x.index.dayofweek)
)

# 线性插值
df['lin_impute'] = df['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")

这是线性插值后的图表。缺失值的索引以红色显示。
Pandas线性插值能捕捉季节性模式吗?

上述方法未捕捉到季节性模式。缺失数据是线性填充的。

方法2

# 保存缺失值的索引
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index

# 从日期时间索引创建HOUR和DAYOFWEEK特征
df= (
    df.assign(HOUR=lambda x: x.index.hour,
              DAYOFWEEK=lambda x: x.index.dayofweek)
)

# 创建一个单独的数据框,其中包含每天每小时的平均车辆数。
hr_pattern = (
    df
    .groupby('HOUR', as_index=False)['NUMBER_OF_VEHICLES']
    .mean()
    .rename(columns={"NUMBER_OF_VEHICLES" : "hr_pattern"})
)

# 创建一个单独的数据框,其中包含每周每天的平均车辆数。
week_pattern = (
    df
    .groupby('DAYOFWEEK', as_index=False)['NUMBER_OF_VEHICLES']
    .mean()
    .rename(columns={"NUMBER_OF_VEHICLES" : "week_pattern"})
)

# 与主数据集合并
df_merged = (
    df.reset_index()
    .merge(hr_pattern, on='HOUR', how='inner')
    .merge(week_pattern, on='DAYOFWEEK', how='inner')
    .set_index('DATE_TIME')
)

# 线性插值
df_merged['lin_impute'] = df_merged['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")

现在,这是方法2之后的图表:
Pandas线性插值能捕捉季节性模式吗?

上述图表显示,缺失值是以一种能够显示数据的小时和每周模式的方式填充的,而不是线性填充。

我的问题是,第二种方法是如何导致这个结果的,其中线性插值的时间序列捕捉了数据中的小时和每周模式,而不是线性填充?请帮助我理解。非常感谢您的帮助。

英文:

I have a time series dataset that contains 29,184 hours of data. There are around 1k+ rows of missing values.

Here is the link to the dataset: dropbox

Here is a small preview of the dataset:

                     NUMBER_OF_VEHICLES
DATE_TIME                              
2020-01-01 00:00:00                39.0
2020-01-01 01:00:00                 3.0
2020-01-01 02:00:00                 1.0
2020-01-01 03:00:00                 1.0
2020-01-01 04:00:00                 2.0

Here is a section of a plot that shows missing values:
Pandas线性插值能捕捉季节性模式吗?

The column NUMBER_OF_VEHICLES contains both hourly and weekly seasonality. I am trying to meaningfully fill in the missing values in a way that captures the hourly and weekly seasonality too. Here are 2 approaches I did:

Approach 1

# Saving index of missing values
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index

# Create HOUR and DAYOFWEEK features from datetime index
df= (
    df.assign(HOUR=lambda x: x.index.hour,
              DAYOFWEEK=lambda x: x.index.dayofweek)
)

# Linear interpolation
df['lin_impute'] = df['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")

Here is the plot of the plot after linear interpolation. The indexes with missing values are shown in red color.
Pandas线性插值能捕捉季节性模式吗?

The above approach does not capture the seasonality patterns. The missing data is filled in linearly.

Approach 2

# Saving index of missing values
missing_ix = df[df['NUMBER_OF_VEHICLES'].isnull()].index

# Create HOUR and DAYOFWEEK features from datetime index
df= (
    df.assign(HOUR=lambda x: x.index.hour,
              DAYOFWEEK=lambda x: x.index.dayofweek)
)

# Create a separate data frame with the mean number of vehicles per hour in a day.
hr_pattern = (
    df
    .groupby('HOUR', as_index=False)['NUMBER_OF_VEHICLES']
    .mean()
    .rename(columns={"NUMBER_OF_VEHICLES" : "hr_pattern"})
)

# Create a separate data frame with the mean number of vehicles per day of week.
week_pattern = (
    df
    .groupby('DAYOFWEEK', as_index=False)['NUMBER_OF_VEHICLES']
    .mean()
    .rename(columns={"NUMBER_OF_VEHICLES" : "week_pattern"})
)

# Merge with the main dataset
df_merged = (
    df.reset_index()
    .merge(hr_pattern, on='HOUR', how='inner')
    .merge(week_pattern, on='DAYOFWEEK', how='inner')
    .set_index('DATE_TIME')
)

# Linear interpolation
df_merged['lin_impute'] = df_merged['NUMBER_OF_VEHICLES'].interpolate("linear", limit_direction="both")

Now, here is the plot after approach 2:
Pandas线性插值能捕捉季节性模式吗?

The above plot shows that the missing values have been filled in a way that shows the hourly and weekly patterns in the data.

My question is how did the 2nd approach lead to this result where the linearly interpolated time series captures both the hourly and weekly patterns in the data instead of filling in the missing values linearly?

Please help me to understand. Thank you so much for your help.

答案1

得分: 1

短答案是否定的。

在第二种情况下会创建两个额外的数据集,其中包含每小时和每周的均值,并将它们与主数据集连接起来,以便用整个数据集的均值填充一些缺失值,然后对其余的缺失值进行插值处理。

换句话说,它似乎捕捉到了季节性,因为它通过每小时和每周的均值来填充一些缺失值,然后剩下的缺失值在线性插值时更好地进行了插值,因为当缺失值的相邻数据值不缺失时,线性插值效果好。这就是为什么在第一种情况下从左边相邻值到右边相邻值是一条直线。

英文:

The short answer is no.

What happens in the second case is that it creates two extra datasets with the mean values per hour and day of the week and joining them with the main dataset, so it fills some of the missing values with the mean of the whole dataset and then interpolating the rest of missing values.

In other words, it seems it captures the seasonality because it populates some of the missing values by the mean per hour and day of the week, and then the missing values left are interpolated better with the linear interpolation because it works well when the adjacent data values to the missing value are not missing.
That's why in the first case it's a straight line from the left adjacent value to the right one.

huangapple
  • 本文由 发表于 2023年6月12日 19:56:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76456424.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定