英文:
How to reverse year over year change to fill the nan values?
问题
我有一个数据框,Value Col
列在 2022-12-31 结束。
Value Col Factor
2022-01-31 0.021 5%
2022-02-28 0.020 4%
2022-03-31 0.019 3%
2022-04-30 0.018 2%
2022-05-31 0.017 9%
2022-06-30 0.016 7%
2022-07-31 0.015 7%
2022-08-31 0.014 5%
2022-09-30 0.013 -6%
2022-10-31 0.018 4%
2022-11-30 0.020 -8%
2022-12-31 0.015 7%
2023-01-31 NaN 5%
2023-02-28 NaN 4%
2023-03-31 NaN 3%
2023-04-30 NaN 4%
2023-05-31 NaN 9%
2023-06-30 NaN -6%
2023-07-31 NaN 7%
2023-08-31 NaN 5%
2023-09-30 NaN 6%
2023-10-31 NaN -4%
2023-11-30 NaN 2%
2023-12-31 NaN 1%
2024-01-31 NaN 5%
2024-02-28 NaN 4%
2024-03-31 NaN 6%
2024-04-30 NaN 2%
2024-05-31 NaN -9%
2024-06-30 NaN 8%
2024-07-31 NaN 6%
2024-08-31 NaN -7%
2024-09-30 NaN 6%
2024-10-31 NaN 4%
2024-11-30 NaN 2%
2024-12-31 NaN -1%
还有一个 Factor
列,显示了百分比,NaN 值应该与前一年同月份的值相比填充多少。
例如,df.loc['2023-04-30', 'Value Col'] 应该是 0.01872(2022-04-30 的值为 0.018,2023-04-30 的因子为 4%。所以,0.018 + 0.018*4% = 0.01872。
这似乎是 pandas 的 pct_change()
函数的反向操作。但我无法弄清楚如何解决它。任何提示或建议将不胜感激。
英文:
I have a dataframe, the Value Col
ends in 2022-12-31.
Value Col Factor
2022-01-31 0.021 5%
2022-02-28 0.020 4%
2022-03-31 0.019 3%
2022-04-30 0.018 2%
2022-05-31 0.017 9%
2022-06-30 0.016 7%
2022-07-31 0.015 7%
2022-08-31 0.014 5%
2022-09-30 0.013 -6%
2022-10-31 0.018 4%
2022-11-30 0.020 -8%
2022-12-31 0.015 7%
2023-01-31 NaN 5%
2023-02-28 NaN 4%
2023-03-31 NaN 3%
2023-04-30 NaN 4%
2023-05-31 NaN 9%
2023-06-30 NaN -6%
2023-07-31 NaN 7%
2023-08-31 NaN 5%
2023-09-30 NaN 6%
2023-10-31 NaN -4%
2023-11-30 NaN 2%
2023-12-31 NaN 1%
2024-01-31 NaN 5%
2024-02-28 NaN 4%
2024-03-31 NaN 6%
2024-04-30 NaN 2%
2024-05-31 NaN -9%
2024-06-30 NaN 8%
2024-07-31 NaN 6%
2024-08-31 NaN -7%
2024-09-30 NaN 6%
2024-10-31 NaN 4%
2024-11-30 NaN 2%
2024-12-31 NaN -1%
And there is a Factor
column which shows the percentage; how much the NaN value should be filled with compared to the same month of the previous year value.
For example, df.loc['2023-04-30', 'Value Col'] should be 0,01872. (value on 2022-04-30 is 0.018 and factor on 2023-04-30 is 4%. So, 0.018 + 0.018*4% = 0.01872.
I seems to me a reverse of pct_change()
function of pandas. But I could not figure it out how to solve it. Any hint or suggestion will be appreciated.
答案1
得分: 1
你可以使用 shift
与 pandas.DateOffset
并为每年重复操作:
# 确保我们有一个日期时间索引
df.index = pd.to_datetime(df.index)
# 将因子提取为浮点数 4% -> 1.04
factor = df['Factor'].str.rstrip('%').astype(float).div(100).add(1)
# 对于每一年,计算前一年的值
for year in range(len(df.index.year.unique())-1):
df.loc[df['Value Col'].isna(), 'Value Col'] = (
df['Value Col']
.shift(freq=pd.DateOffset(years=1))
.mul(factor)
)
输出:
Value Col Factor
2022-01-31 0.021000 5%
2022-02-28 0.020000 4%
2022-03-31 0.019000 3%
2022-04-30 0.018000 2%
2022-05-31 0.017000 9%
2022-06-30 0.016000 7%
2022-07-31 0.015000 7%
2022-08-31 0.014000 5%
2022-09-30 0.013000 -6%
2022-10-31 0.018000 4%
2022-11-30 0.020000 -8%
2022-12-31 0.015000 7%
2023-01-31 0.022050 5%
2023-02-28 0.020800 4%
2023-03-31 0.019570 3%
2023-04-30 0.018720 4%
2023-05-31 0.018530 9%
2023-06-30 0.015040 -6%
2023-07-31 0.016050 7%
2023-08-31 0.014700 5%
2023-09-30 0.013780 6%
2023-10-31 0.017280 -4%
2023-11-30 0.020400 2%
2023-12-31 0.015150 1%
2024-01-31 0.023153 5%
2024-02-28 0.021632 4%
2024-03-31 0.020744 6%
2024-04-30 0.019094 2%
2024-05-31 0.016862 -9%
2024-06-30 0.016243 8%
2024-07-31 0.017013 6%
2024-08-31 0.013671 -7%
2024-09-30 0.014607 6%
2024-10-31 0.017971 4%
2024-11-30 0.020808 2%
2024-12-31 0.014998 -1%
替代方案
你还可以计算因子的累积乘积(groupby.cumprod
)并乘以向前填充的值(groupby.ffill
):
group = df.index.month
factor = (df['Factor'].str.rstrip('%').astype(float).div(100).add(1)
.where(df['Value Col'].isna())
.groupby(group).cumprod()
)
df.loc[df['Value Col'].isna(), 'Value Col'] = (
df['Value Col'].groupby(group).ffill()
.mul(factor)
)
注意:这要求日期按升序排序。
中间结果:
Value Col Factor group initial factor
2022-01-31 0.021000 5% 1 0.021 NaN
2022-02-28 0.020000 4% 2 0.020 NaN
2022-03-31 0.019000 3% 3 0.019 NaN
2022-04-30 0.018000 2% 4 0.018 NaN
2022-05-31 0.017000 9% 5 0.017 NaN
2022-06-30 0.016000 7% 6 0.016 NaN
2022-07-31 0.015000 7% 7 0.015 NaN
2022-08-31 0.014000 5% 8 0.014 NaN
2022-09-30 0.013000 -6% 9 0.013 NaN
2022-10-31 0.018000 4% 10 0.018 NaN
2022-11-30 0.020000 -8% 11 0.020 NaN
2022-12-31 0.015000 7% 12 0.015 NaN
2023-01-31 0.022050 5% 1 0.021 1.0500
2023-02-28 0.020800 4% 2 0.020 1.0400
2023-03-31 0.019570 3% 3 0.019 1.0300
2023-04-30 0.018720 4% 4 0.018 1.0400
2023-05-31 0.018530 9% 5 0.017 1.0900
2023-06-30 0.015040 -6% 6 0.016 0.9400
2023-07-31 0.016050 7% 7
<details>
<summary>英文:</summary>
You can use [`shift`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html) with [`pandas.DateOffset`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.DateOffset.html) and repeat the operation for each year:
ensure we have a datetime index
df.index = pd.to_datetime(df.index)
extract the factor as float 4% -> 1.04
factor = df['Factor'].str.rstrip('%').astype(float).div(100).add(1)
for each year, compute the value from the preceding year
for year in range(len(df.index.year.unique())-1):
df.loc[df['Value Col'].isna(), 'Value Col'] = (
df['Value Col']
.shift(freq=pd.DateOffset(years=1))
.mul(factor)
)
Output:
Value Col Factor
2022-01-31 0.021000 5%
2022-02-28 0.020000 4%
2022-03-31 0.019000 3%
2022-04-30 0.018000 2%
2022-05-31 0.017000 9%
2022-06-30 0.016000 7%
2022-07-31 0.015000 7%
2022-08-31 0.014000 5%
2022-09-30 0.013000 -6%
2022-10-31 0.018000 4%
2022-11-30 0.020000 -8%
2022-12-31 0.015000 7%
2023-01-31 0.022050 5%
2023-02-28 0.020800 4%
2023-03-31 0.019570 3%
2023-04-30 0.018720 4%
2023-05-31 0.018530 9%
2023-06-30 0.015040 -6%
2023-07-31 0.016050 7%
2023-08-31 0.014700 5%
2023-09-30 0.013780 6%
2023-10-31 0.017280 -4%
2023-11-30 0.020400 2%
2023-12-31 0.015150 1%
2024-01-31 0.023153 5%
2024-02-28 0.021632 4%
2024-03-31 0.020744 6%
2024-04-30 0.019094 2%
2024-05-31 0.016862 -9%
2024-06-30 0.016243 8%
2024-07-31 0.017013 6%
2024-08-31 0.013671 -7%
2024-09-30 0.014607 6%
2024-10-31 0.017971 4%
2024-11-30 0.020808 2%
2024-12-31 0.014998 -1%
#### alternative
You should also be able to compute the cumulated product of the factors ([`groupby.cumprod`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cumprod.html)) and multiply the forward filled values ([`groupby.ffill`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.ffill.html)):
group = df.index.month
factor = (df['Factor'].str.rstrip('%').astype(float).div(100).add(1)
.where(df['Value Col'].isna())
.groupby(group).cumprod()
)
df.loc[df['Value Col'].isna(), 'Value Col'] = (
df['Value Col'].groupby(group).ffill()
.mul(factor)
)
*NB. this requires the dates to be sorted in ascending order.*
Intermediates:
Value Col Factor group initial factor
2022-01-31 0.021000 5% 1 0.021 NaN
2022-02-28 0.020000 4% 2 0.020 NaN
2022-03-31 0.019000 3% 3 0.019 NaN
2022-04-30 0.018000 2% 4 0.018 NaN
2022-05-31 0.017000 9% 5 0.017 NaN
2022-06-30 0.016000 7% 6 0.016 NaN
2022-07-31 0.015000 7% 7 0.015 NaN
2022-08-31 0.014000 5% 8 0.014 NaN
2022-09-30 0.013000 -6% 9 0.013 NaN
2022-10-31 0.018000 4% 10 0.018 NaN
2022-11-30 0.020000 -8% 11 0.020 NaN
2022-12-31 0.015000 7% 12 0.015 NaN
2023-01-31 0.022050 5% 1 0.021 1.0500
2023-02-28 0.020800 4% 2 0.020 1.0400
2023-03-31 0.019570 3% 3 0.019 1.0300
2023-04-30 0.018720 4% 4 0.018 1.0400
2023-05-31 0.018530 9% 5 0.017 1.0900
2023-06-30 0.015040 -6% 6 0.016 0.9400
2023-07-31 0.016050 7% 7 0.015 1.0700
2023-08-31 0.014700 5% 8 0.014 1.0500
2023-09-30 0.013780 6% 9 0.013 1.0600
2023-10-31 0.017280 -4% 10 0.018 0.9600
2023-11-30 0.020400 2% 11 0.020 1.0200
2023-12-31 0.015150 1% 12 0.015 1.0100
2024-01-31 0.023153 5% 1 0.021 1.1025
2024-02-28 0.021632 4% 2 0.020 1.0816
2024-03-31 0.020744 6% 3 0.019 1.0918
2024-04-30 0.019094 2% 4 0.018 1.0608
2024-05-31 0.016862 -9% 5 0.017 0.9919
2024-06-30 0.016243 8% 6 0.016 1.0152
2024-07-31 0.017013 6% 7 0.015 1.1342
2024-08-31 0.013671 -7% 8 0.014 0.9765
2024-09-30 0.014607 6% 9 0.013 1.1236
2024-10-31 0.017971 4% 10 0.018 0.9984
2024-11-30 0.020808 2% 11 0.020 1.0404
2024-12-31 0.014998 -1% 12 0.015 0.9999
</details>
# 答案2
**得分**: 1
你可以使用循环来替换明年的缺失值:
```python
df['f'] = df['Factor'].str.extract(r'(-*\d+)').astype(int).div(100)
years = df.index.year
for y in years.unique():
# 原年份
m = years == y
# 明年
m1 = years == y + 1
# 更改DatetimeIndex
df1 = df[m].rename(lambda x: x + pd.offsets.DateOffset(years=1))
# 计算新值并替换NaN
s = df1['Value Col'].add(df.loc[m1, 'f'].mul(df1['Value Col']))
df.loc[m1, 'Value Col'] = df.loc[m1, 'Value Col'].fillna(s)
print (df)
Value Col Factor f
2022-01-31 0.021000 5% 0.05
2022-02-28 0.020000 4% 0.04
2022-03-31 0.019000 3% 0.03
2022-04-30 0.018000 2% 0.02
2022-05-31 0.017000 9% 0.09
2022-06-30 0.016000 7% 0.07
2022-07-31 0.015000 7% 0.07
2022-08-31 0.014000 5% 0.05
2022-09-30 0.013000 -6% -0.06
2022-10-31 0.018000 4% 0.04
2022-11-30 0.020000 -8% -0.08
2022-12-31 0.015000 7% 0.07
2023-01-31 0.022050 5% 0.05
2023-02-28 0.020800 4% 0.04
2023-03-31 0.019570 3% 0.03
2023-04-30 0.018720 4% 0.04
2023-05-31 0.018530 9% 0.09
2023-06-30 0.015040 -6% -0.06
2023-07-31 0.016050 7% 0.07
2023-08-31 0.014700 5% 0.05
2023-09-30 0.013780 6% 0.06
2023-10-31 0.017280 -4% -0.04
2023-11-30 0.020400 2% 0.02
2023-12-31 0.015150 1% 0.01
2024-01-31 0.023152 5% 0.05
2024-02-28 0.021632 4% 0.04
2024-03-31 0.020744 6% 0.06
2024-04-30 0.019094 2% 0.02
2024-05-31 0.016862 -9% -0.09
2024-06-30 0.016243 8% 0.08
2024-07-31 0.017013 6% 0.06
2024-08-31 0.013671 -7% -0.07
2024-09-30 0.014607 6% 0.06
2024-10-31 0.017971 4% 0.04
2024-11-30 0.020808 2% 0.02
2024-12-31 0.014998 -1% -0.01
英文:
You can use loop for replace missing values in next year:
df['f'] = df['Factor'].str.extract(r'(-*\d+)').astype(int).div(100)
years = df.index.year
for y in years.unique():
#original year
m = years == y
#next year
m1 = years == y + 1
#change DatetimeIndex
df1 = df[m].rename(lambda x: x + pd.offsets.DateOffset(years=1))
#count new values and replace NaNs
s = df1['Value Col'].add(df.loc[m1, 'f'].mul(df1['Value Col']))
df.loc[m1, 'Value Col'] = df.loc[m1, 'Value Col'].fillna(s)
print (df)
Value Col Factor f
2022-01-31 0.021000 5% 0.05
2022-02-28 0.020000 4% 0.04
2022-03-31 0.019000 3% 0.03
2022-04-30 0.018000 2% 0.02
2022-05-31 0.017000 9% 0.09
2022-06-30 0.016000 7% 0.07
2022-07-31 0.015000 7% 0.07
2022-08-31 0.014000 5% 0.05
2022-09-30 0.013000 -6% -0.06
2022-10-31 0.018000 4% 0.04
2022-11-30 0.020000 -8% -0.08
2022-12-31 0.015000 7% 0.07
2023-01-31 0.022050 5% 0.05
2023-02-28 0.020800 4% 0.04
2023-03-31 0.019570 3% 0.03
2023-04-30 0.018720 4% 0.04
2023-05-31 0.018530 9% 0.09
2023-06-30 0.015040 -6% -0.06
2023-07-31 0.016050 7% 0.07
2023-08-31 0.014700 5% 0.05
2023-09-30 0.013780 6% 0.06
2023-10-31 0.017280 -4% -0.04
2023-11-30 0.020400 2% 0.02
2023-12-31 0.015150 1% 0.01
2024-01-31 0.023152 5% 0.05
2024-02-28 0.021632 4% 0.04
2024-03-31 0.020744 6% 0.06
2024-04-30 0.019094 2% 0.02
2024-05-31 0.016862 -9% -0.09
2024-06-30 0.016243 8% 0.08
2024-07-31 0.017013 6% 0.06
2024-08-31 0.013671 -7% -0.07
2024-09-30 0.014607 6% 0.06
2024-10-31 0.017971 4% 0.04
2024-11-30 0.020808 2% 0.02
2024-12-31 0.014998 -1% -0.01
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论