英文:
Using python, how to fill in missing dates and data in two columns
问题
我有一个按月/日-时间排序的时间序列,后面跟着数值。由于设备故障,有些时间点的数据缺失。我想要替换这些时间点(例如下面的"21:00","01:00"),并且插值得到与它们相关的缺失数值。有什么好的方法可以做到这一点?
数据看起来像这样:
03/31 19:00 68.0
03/31 20:00 68.0
03/31 22:00 70.0
03/31 23:00 68.0
04/01 00:00 69.0
04/01 02:00 70.0
"04/01 00:00"
的数值是字符串,而观测值是浮点数。
我使用以下代码将字符串日期转换为数字:
date_number=datetime.strptime(col_1[i],'%m/%d %H:%M')
这会得到"1900-03-31 19:00:00"
作为结果。我可以对这些数值进行运算,找到缺失的时间点,将它们填充并在另一列的相同位置放入NaN值,然后对这些缺失的数值进行插值。我确信有一种更高效、标准的方法来解决这个问题,我想知道如何最好地做到这一点。
英文:
I have a time series of month/day-time followed by values. With equipment failure, some times are missing. I want to replace those times (e.g. 21:00, 01:00 below) and interpolate the missing values associated with them. What is a good way to do this?
The data looks like:
03/31 19:00 68.0
03/31 20:00 68.0
03/31 22:00 70.0
03/31 23:00 68.0
04/01 00:00 69.0
04/01 02:00 70.0
The "04/01 00:00"
values are strings and the observations are floats.
I converted the string dates to numbers using:
date_number=datetime.strptime(col_1[i],'%m/%d %H:%M')
which yields "1900-03-31 19:00:00"
as the result. I can do arithmetic on those, find the gaps, fill them in and put nans in the same place in the other column then interpolate those missing values. I'm sure there is a more efficient, standard approach to the problem, and I'd like to know how to best do it.
答案1
得分: 1
假设有以下的数据框:
>>> df
日期 值
0 03/31 19:00 68.0
1 03/31 20:00 68.0
2 03/31 22:00 70.0
3 03/31 23:00 68.0
4 04/01 00:00 69.0
5 04/01 02:00 70.0
你可以创建一个以日期
为索引的Series
,真正用于处理时间序列的操作:
df['日期'] = pd.to_datetime('2023/' + df['日期'], format='%Y/%m/%d %H:%M')
ts = df.set_index('日期')['值'].resample('H').interpolate()
输出:
>>> ts
日期
2023-03-31 19:00:00 68.0
2023-03-31 20:00:00 68.0
2023-03-31 21:00:00 69.0 # <- 在这里
2023-03-31 22:00:00 70.0
2023-03-31 23:00:00 68.0
2023-04-01 00:00:00 69.0
2023-04-01 01:00:00 69.5 # <- 在这里
2023-04-01 02:00:00 70.0
Freq: H, Name: 值, dtype: float64
英文:
Suppose the following dataframe:
>>> df
Date Value
0 03/31 19:00 68.0
1 03/31 20:00 68.0
2 03/31 22:00 70.0
3 03/31 23:00 68.0
4 04/01 00:00 69.0
5 04/01 02:00 70.0
You can create a Series
indexed by Date
to really work on TimeSeries:
df['Date'] = pd.to_datetime('2023/' + df['Date'], format='%Y/%m/%d %H:%M')
ts = df.set_index('Date')['Value'].resample('H').interpolate()
Output:
>>> ts
Date
2023-03-31 19:00:00 68.0
2023-03-31 20:00:00 68.0
2023-03-31 21:00:00 69.0 # <- HERE
2023-03-31 22:00:00 70.0
2023-03-31 23:00:00 68.0
2023-04-01 00:00:00 69.0
2023-04-01 01:00:00 69.5 # <- HERE
2023-04-01 02:00:00 70.0
Freq: H, Name: Value, dtype: float64
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论