如何使用`scipy`中的`interp1d(x, y)`函数插值月度频率样本数据的缺失值

huangapple go评论107阅读模式
英文:

How to interpolate monthly frequency sample data's missing values with interp1d(x, y) from scipy

问题

我已经创建了名为data的每月样本数据,其中某些月份存在缺失值,我希望使用interp1d()方法来填充它们。我已经用以下代码实现了它,但结果仍然为空,我不知道问题出在哪里。请问如何修改代码?非常感谢。

  1. import pandas as pd
  2. import numpy as np
  3. from scipy.interpolate import interp1d
  4. # 创建一个示例DataFrame
  5. data = pd.DataFrame({
  6. 'value': [1.0, 1.2, np.nan, 1.4, 1.6, np.nan, 1.8, 2.0, np.nan, 2.2, 2.4, np.nan]
  7. }, index=pd.date_range('2000-01-01', periods=12, freq='M'))
  8. # 将索引转换为DateTimeIndex
  9. data.index = pd.to_datetime(data.index)
  10. # 将DateTimeIndex转换为具有月度频率的PeriodIndex
  11. x = data.index.to_period('M')
  12. # 将周期索引转换为整数
  13. x = x.astype(int)
  14. # 将'y'列转换为numpy数组
  15. y = data['value'].values
  16. # 创建插值函数
  17. f = interp1d(x, y, kind='linear', fill_value="extrapolate")
  18. # 创建一个布尔掩码,选择'value'列中的缺失值
  19. mask = np.isnan(data['value'])
  20. # 创建一个包含'y'缺失的'x'值的数组
  21. x_new = pd.date_range(start=data.index.min(), end=data.index.max(), freq='M')[mask]
  22. # 将'x_new'值转换为具有月度频率的日期
  23. x_new_dates = pd.date_range(start=x_new.min(), end=x_new.max(), freq='M')
  24. # 插值缺失的'y'值
  25. y_new = f(x_new_dates.astype(int))
  26. # 创建一个新列'value_interpolated',并用原始数据填充它
  27. # 将插值的'y'值插入新列
  28. data.loc[x_new_dates, 'value_interpolated'] = y_new
  29. # 打印DataFrame
  30. print(data)

输出:

  1. value value_interpolated
  2. 2000-01-31 1.0 NaN
  3. 2000-02-29 1.2 NaN
  4. 2000-03-31 NaN NaN
  5. 2000-04-30 1.4 NaN
  6. 2000-05-31 1.6 NaN
  7. 2000-06-30 NaN NaN
  8. 2000-07-31 1.8 NaN
  9. 2000-08-31 2.0 NaN
  10. 2000-09-30 NaN NaN
  11. 2000-10-31 2.2 NaN
  12. 2000-11-30 2.4 NaN
  13. 2000-12-31 NaN NaN
英文:

I have created monthly sample data data, in which there are missing values in some months, and I hope to fill them in by interp1d() method. I have implemented it with the following code, but the result is still empty, and I don’t know where the problem lies. May I ask how to modify the code? Many thanks.

  1. import pandas as pd
  2. import numpy as np
  3. from scipy.interpolate import interp1d
  4. # Create an example DataFrame
  5. data = pd.DataFrame({
  6. 'value': [1.0, 1.2, np.nan, 1.4, 1.6, np.nan, 1.8, 2.0, np.nan, 2.2, 2.4, np.nan]
  7. }, index=pd.date_range('2000-01-01', periods=12, freq='M'))
  8. # Convert the index to a DateTimeIndex
  9. data.index = pd.to_datetime(data.index)
  10. # Convert the DateTimeIndex to a PeriodIndex with monthly frequency
  11. x = data.index.to_period('M')
  12. # Convert the period index to integers
  13. x = x.astype(int)
  14. # Convert the 'y' column to a numpy array
  15. y = data['value'].values
  16. # Create the interpolation function
  17. f = interp1d(x, y, kind='linear', fill_value="extrapolate")
  18. # Create a boolean mask that selects the missing values in the 'value' column
  19. mask = np.isnan(data['value'])
  20. # Create an array with the 'x' values where 'y' is missing
  21. x_new = pd.date_range(start=data.index.min(), end=data.index.max(), freq='M')[mask]
  22. # Convert the 'x_new' values to dates with monthly frequency
  23. x_new_dates = pd.date_range(start=x_new.min(), end=x_new.max(), freq='M')
  24. # Interpolate the missing 'y' values
  25. y_new = f(x_new_dates. astype(int))
  26. # Create a new column 'value_c' and fill it with the original data
  27. # Insert the interpolated 'y' values into the new column
  28. data.loc[x_new_dates, 'value_interpolated'] = y_new
  29. # Print the DataFrame
  30. print(data)

Out:

  1. value value_interpolated
  2. 2000-01-31 1.0 NaN
  3. 2000-02-29 1.2 NaN
  4. 2000-03-31 NaN NaN
  5. 2000-04-30 1.4 NaN
  6. 2000-05-31 1.6 NaN
  7. 2000-06-30 NaN NaN
  8. 2000-07-31 1.8 NaN
  9. 2000-08-31 2.0 NaN
  10. 2000-09-30 NaN NaN
  11. 2000-10-31 2.2 NaN
  12. 2000-11-30 2.4 NaN
  13. 2000-12-31 NaN NaN

答案1

得分: 1

以下是您要翻译的内容:

您可以使用与某个参考时间的秒数进行插值,如下所示的此答案中所示。由于存在大量缺失数据,我无法保证这些结果的准确性。

  1. import pandas as pd
  2. import numpy as np
  3. from scipy.interpolate import interp1d
  4. data = pd.DataFrame({
  5. "value": [1.0, 1.2, np.nan, 1.4, 1.6, np.nan, 1.8, 2.0, np.nan, 2.2, 2.4, np.nan]
  6. }, index=pd.date_range("2000-01-01", periods=12, freq="M"))
  7. data.index = pd.to_datetime(data.index)
  8. mask = ~np.isnan(data["value"]) # mask out the missing values
  9. dref = data.index[0]
  10. x = (data.index-dref).total_seconds()[mask]
  11. y = data["value"][mask].to_numpy()
  12. f = interp1d(x, y, fill_value="extrapolate")
  13. y_new = f((data.index - dref).total_seconds())
  14. data["value_interpolated"] = y_new

输出:

  1. value value_interpolated
  2. 2000-01-31 1.0 1.000000
  3. 2000-02-29 1.2 1.200000
  4. 2000-03-31 NaN 1.301639
  5. 2000-04-30 1.4 1.400000
  6. 2000-05-31 1.6 1.600000
  7. 2000-06-30 NaN 1.698361
  8. 2000-07-31 1.8 1.800000
  9. 2000-08-31 2.0 2.000000
  10. 2000-09-30 NaN 2.098361
  11. 2000-10-31 2.2 2.200000
  12. 2000-11-30 2.4 2.400000
  13. 2000-12-31 NaN 2.606667
英文:

You can interpolate the values using the seconds from some reference time (below I used the first date) as shown in this answer. I can't guarantee the accuracy of these results since there is a lot of missing data to interpolate.

  1. import pandas as pd
  2. import numpy as np
  3. from scipy.interpolate import interp1d
  4. data = pd.DataFrame({
  5. "value": [1.0, 1.2, np.nan, 1.4, 1.6, np.nan, 1.8, 2.0, np.nan, 2.2, 2.4, np.nan]
  6. }, index=pd.date_range("2000-01-01", periods=12, freq="M"))
  7. data.index = pd.to_datetime(data.index)
  8. mask = ~np.isnan(data["value"]) # mask out the missing values
  9. dref = data.index[0]
  10. x = (data.index-dref).total_seconds()[mask]
  11. y = data["value"][mask].to_numpy()
  12. f = interp1d(x, y, fill_value="extrapolate")
  13. y_new = f((data.index - dref).total_seconds())
  14. data["value_interpolated"] = y_new

Out:

  1. value value_interpolated
  2. 2000-01-31 1.0 1.000000
  3. 2000-02-29 1.2 1.200000
  4. 2000-03-31 NaN 1.301639
  5. 2000-04-30 1.4 1.400000
  6. 2000-05-31 1.6 1.600000
  7. 2000-06-30 NaN 1.698361
  8. 2000-07-31 1.8 1.800000
  9. 2000-08-31 2.0 2.000000
  10. 2000-09-30 NaN 2.098361
  11. 2000-10-31 2.2 2.200000
  12. 2000-11-30 2.4 2.400000
  13. 2000-12-31 NaN 2.606667

huangapple
  • 本文由 发表于 2023年6月18日 19:52:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500405.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定