如何将年度变化反转以填充NaN值?

huangapple go评论75阅读模式
英文:

How to reverse year over year change to fill the nan values?

问题

我有一个数据框,Value Col 列在 2022-12-31 结束。

                  Value Col    Factor
        2022-01-31	0.021       5%
        2022-02-28	0.020       4%
        2022-03-31	0.019       3%
        2022-04-30	0.018       2%
        2022-05-31	0.017       9%
        2022-06-30	0.016       7%
        2022-07-31	0.015       7%
        2022-08-31	0.014       5%
        2022-09-30	0.013       -6%
        2022-10-31	0.018       4%
        2022-11-30	0.020       -8%
        2022-12-31	0.015       7%
        2023-01-31	NaN         5%
        2023-02-28	NaN         4%
        2023-03-31	NaN         3%
        2023-04-30	NaN         4%
        2023-05-31	NaN         9%
        2023-06-30	NaN         -6%
        2023-07-31	NaN         7%
        2023-08-31	NaN         5%
        2023-09-30	NaN         6%
        2023-10-31	NaN         -4%
        2023-11-30	NaN         2%
        2023-12-31	NaN         1%
        2024-01-31	NaN         5%
        2024-02-28	NaN         4%
        2024-03-31	NaN         6%
        2024-04-30	NaN         2%
        2024-05-31	NaN         -9%
        2024-06-30	NaN         8%
        2024-07-31	NaN         6%
        2024-08-31	NaN         -7%
        2024-09-30	NaN         6%
        2024-10-31	NaN         4%
        2024-11-30	NaN         2%
        2024-12-31	NaN         -1%

还有一个 Factor 列,显示了百分比,NaN 值应该与前一年同月份的值相比填充多少。

例如,df.loc['2023-04-30', 'Value Col'] 应该是 0.01872(2022-04-30 的值为 0.018,2023-04-30 的因子为 4%。所以,0.018 + 0.018*4% = 0.01872。

这似乎是 pandas 的 pct_change() 函数的反向操作。但我无法弄清楚如何解决它。任何提示或建议将不胜感激。

英文:

I have a dataframe, the Value Col ends in 2022-12-31.

              Value Col    Factor
    2022-01-31	0.021       5%
    2022-02-28	0.020       4%
    2022-03-31	0.019       3%
    2022-04-30	0.018       2%
    2022-05-31	0.017       9%
    2022-06-30	0.016       7%
    2022-07-31	0.015       7%
    2022-08-31	0.014       5%
    2022-09-30	0.013       -6%
    2022-10-31	0.018       4%
    2022-11-30	0.020       -8%
    2022-12-31	0.015       7%
    2023-01-31	NaN         5%
    2023-02-28	NaN         4%
    2023-03-31	NaN         3%
    2023-04-30	NaN         4%
    2023-05-31	NaN         9%
    2023-06-30	NaN         -6%
    2023-07-31	NaN         7%
    2023-08-31	NaN         5%
    2023-09-30	NaN         6%
    2023-10-31	NaN         -4%
    2023-11-30	NaN         2%
    2023-12-31	NaN         1%
    2024-01-31	NaN         5%
    2024-02-28	NaN         4%
    2024-03-31	NaN         6%
    2024-04-30	NaN         2%
    2024-05-31	NaN         -9%
    2024-06-30	NaN         8%
    2024-07-31	NaN         6%
    2024-08-31	NaN         -7%
    2024-09-30	NaN         6%
    2024-10-31	NaN         4%
    2024-11-30	NaN         2%
    2024-12-31	NaN         -1%

And there is a Factor column which shows the percentage; how much the NaN value should be filled with compared to the same month of the previous year value.
For example, df.loc['2023-04-30', 'Value Col'] should be 0,01872. (value on 2022-04-30 is 0.018 and factor on 2023-04-30 is 4%. So, 0.018 + 0.018*4% = 0.01872.

I seems to me a reverse of pct_change() function of pandas. But I could not figure it out how to solve it. Any hint or suggestion will be appreciated.

答案1

得分: 1

你可以使用 shiftpandas.DateOffset 并为每年重复操作:

# 确保我们有一个日期时间索引
df.index = pd.to_datetime(df.index)

# 将因子提取为浮点数 4% -> 1.04
factor = df['Factor'].str.rstrip('%').astype(float).div(100).add(1)

# 对于每一年,计算前一年的值
for year in range(len(df.index.year.unique())-1):
    df.loc[df['Value Col'].isna(), 'Value Col'] = (
                           df['Value Col']
                           .shift(freq=pd.DateOffset(years=1))
                           .mul(factor)
                         )

输出:

            Value Col Factor
2022-01-31   0.021000     5%
2022-02-28   0.020000     4%
2022-03-31   0.019000     3%
2022-04-30   0.018000     2%
2022-05-31   0.017000     9%
2022-06-30   0.016000     7%
2022-07-31   0.015000     7%
2022-08-31   0.014000     5%
2022-09-30   0.013000    -6%
2022-10-31   0.018000     4%
2022-11-30   0.020000    -8%
2022-12-31   0.015000     7%
2023-01-31   0.022050     5%
2023-02-28   0.020800     4%
2023-03-31   0.019570     3%
2023-04-30   0.018720     4%
2023-05-31   0.018530     9%
2023-06-30   0.015040    -6%
2023-07-31   0.016050     7%
2023-08-31   0.014700     5%
2023-09-30   0.013780     6%
2023-10-31   0.017280    -4%
2023-11-30   0.020400     2%
2023-12-31   0.015150     1%
2024-01-31   0.023153     5%
2024-02-28   0.021632     4%
2024-03-31   0.020744     6%
2024-04-30   0.019094     2%
2024-05-31   0.016862    -9%
2024-06-30   0.016243     8%
2024-07-31   0.017013     6%
2024-08-31   0.013671    -7%
2024-09-30   0.014607     6%
2024-10-31   0.017971     4%
2024-11-30   0.020808     2%
2024-12-31   0.014998    -1%

替代方案

你还可以计算因子的累积乘积(groupby.cumprod)并乘以向前填充的值(groupby.ffill):

group = df.index.month
factor = (df['Factor'].str.rstrip('%').astype(float).div(100).add(1)
          .where(df['Value Col'].isna())
          .groupby(group).cumprod()
         )
df.loc[df['Value Col'].isna(), 'Value Col'] = (
   df['Value Col'].groupby(group).ffill()
   .mul(factor)
)

注意:这要求日期按升序排序。

中间结果:

            Value Col Factor  group  initial  factor
2022-01-31   0.021000     5%      1    0.021     NaN
2022-02-28   0.020000     4%      2    0.020     NaN
2022-03-31   0.019000     3%      3    0.019     NaN
2022-04-30   0.018000     2%      4    0.018     NaN
2022-05-31   0.017000     9%      5    0.017     NaN
2022-06-30   0.016000     7%      6    0.016     NaN
2022-07-31   0.015000     7%      7    0.015     NaN
2022-08-31   0.014000     5%      8    0.014     NaN
2022-09-30   0.013000    -6%      9    0.013     NaN
2022-10-31   0.018000     4%     10    0.018     NaN
2022-11-30   0.020000    -8%     11    0.020     NaN
2022-12-31   0.015000     7%     12    0.015     NaN
2023-01-31   0.022050     5%      1    0.021  1.0500
2023-02-28   0.020800     4%      2    0.020  1.0400
2023-03-31   0.019570     3%      3    0.019  1.0300
2023-04-30   0.018720     4%      4    0.018  1.0400
2023-05-31   0.018530     9%      5    0.017  1.0900
2023-06-30   0.015040    -6%      6    0.016  0.9400
2023-07-31   0.016050     7%      7   

<details>
<summary>英文:</summary>

You can use [`shift`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html) with [`pandas.DateOffset`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.DateOffset.html) and repeat the operation for each year:

ensure we have a datetime index

df.index = pd.to_datetime(df.index)

extract the factor as float 4% -> 1.04

factor = df['Factor'].str.rstrip('%').astype(float).div(100).add(1)

for each year, compute the value from the preceding year

for year in range(len(df.index.year.unique())-1):
df.loc[df['Value Col'].isna(), 'Value Col'] = (
df['Value Col']
.shift(freq=pd.DateOffset(years=1))
.mul(factor)
)

Output:
        Value Col Factor

2022-01-31 0.021000 5%
2022-02-28 0.020000 4%
2022-03-31 0.019000 3%
2022-04-30 0.018000 2%
2022-05-31 0.017000 9%
2022-06-30 0.016000 7%
2022-07-31 0.015000 7%
2022-08-31 0.014000 5%
2022-09-30 0.013000 -6%
2022-10-31 0.018000 4%
2022-11-30 0.020000 -8%
2022-12-31 0.015000 7%
2023-01-31 0.022050 5%
2023-02-28 0.020800 4%
2023-03-31 0.019570 3%
2023-04-30 0.018720 4%
2023-05-31 0.018530 9%
2023-06-30 0.015040 -6%
2023-07-31 0.016050 7%
2023-08-31 0.014700 5%
2023-09-30 0.013780 6%
2023-10-31 0.017280 -4%
2023-11-30 0.020400 2%
2023-12-31 0.015150 1%
2024-01-31 0.023153 5%
2024-02-28 0.021632 4%
2024-03-31 0.020744 6%
2024-04-30 0.019094 2%
2024-05-31 0.016862 -9%
2024-06-30 0.016243 8%
2024-07-31 0.017013 6%
2024-08-31 0.013671 -7%
2024-09-30 0.014607 6%
2024-10-31 0.017971 4%
2024-11-30 0.020808 2%
2024-12-31 0.014998 -1%

#### alternative
You should also be able to compute the cumulated product of the factors ([`groupby.cumprod`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cumprod.html)) and multiply the forward filled values ([`groupby.ffill`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.ffill.html)):

group = df.index.month
factor = (df['Factor'].str.rstrip('%').astype(float).div(100).add(1)
.where(df['Value Col'].isna())
.groupby(group).cumprod()
)
df.loc[df['Value Col'].isna(), 'Value Col'] = (
df['Value Col'].groupby(group).ffill()
.mul(factor)
)

*NB. this requires the dates to be sorted in ascending order.*

Intermediates:
        Value Col Factor  group  initial  factor

2022-01-31 0.021000 5% 1 0.021 NaN
2022-02-28 0.020000 4% 2 0.020 NaN
2022-03-31 0.019000 3% 3 0.019 NaN
2022-04-30 0.018000 2% 4 0.018 NaN
2022-05-31 0.017000 9% 5 0.017 NaN
2022-06-30 0.016000 7% 6 0.016 NaN
2022-07-31 0.015000 7% 7 0.015 NaN
2022-08-31 0.014000 5% 8 0.014 NaN
2022-09-30 0.013000 -6% 9 0.013 NaN
2022-10-31 0.018000 4% 10 0.018 NaN
2022-11-30 0.020000 -8% 11 0.020 NaN
2022-12-31 0.015000 7% 12 0.015 NaN
2023-01-31 0.022050 5% 1 0.021 1.0500
2023-02-28 0.020800 4% 2 0.020 1.0400
2023-03-31 0.019570 3% 3 0.019 1.0300
2023-04-30 0.018720 4% 4 0.018 1.0400
2023-05-31 0.018530 9% 5 0.017 1.0900
2023-06-30 0.015040 -6% 6 0.016 0.9400
2023-07-31 0.016050 7% 7 0.015 1.0700
2023-08-31 0.014700 5% 8 0.014 1.0500
2023-09-30 0.013780 6% 9 0.013 1.0600
2023-10-31 0.017280 -4% 10 0.018 0.9600
2023-11-30 0.020400 2% 11 0.020 1.0200
2023-12-31 0.015150 1% 12 0.015 1.0100
2024-01-31 0.023153 5% 1 0.021 1.1025
2024-02-28 0.021632 4% 2 0.020 1.0816
2024-03-31 0.020744 6% 3 0.019 1.0918
2024-04-30 0.019094 2% 4 0.018 1.0608
2024-05-31 0.016862 -9% 5 0.017 0.9919
2024-06-30 0.016243 8% 6 0.016 1.0152
2024-07-31 0.017013 6% 7 0.015 1.1342
2024-08-31 0.013671 -7% 8 0.014 0.9765
2024-09-30 0.014607 6% 9 0.013 1.1236
2024-10-31 0.017971 4% 10 0.018 0.9984
2024-11-30 0.020808 2% 11 0.020 1.0404
2024-12-31 0.014998 -1% 12 0.015 0.9999


</details>



# 答案2
**得分**: 1

你可以使用循环来替换明年的缺失值:

```python
df['f'] = df['Factor'].str.extract(r'(-*\d+)').astype(int).div(100)

years = df.index.year

for y in years.unique():

    # 原年份
    m = years == y
    # 明年
    m1 = years == y + 1
    # 更改DatetimeIndex
    df1 = df[m].rename(lambda x: x + pd.offsets.DateOffset(years=1))
    # 计算新值并替换NaN
    s = df1['Value Col'].add(df.loc[m1, 'f'].mul(df1['Value Col']))
    df.loc[m1, 'Value Col'] = df.loc[m1, 'Value Col'].fillna(s)
print (df)
            Value Col Factor     f
2022-01-31   0.021000     5%  0.05
2022-02-28   0.020000     4%  0.04
2022-03-31   0.019000     3%  0.03
2022-04-30   0.018000     2%  0.02
2022-05-31   0.017000     9%  0.09
2022-06-30   0.016000     7%  0.07
2022-07-31   0.015000     7%  0.07
2022-08-31   0.014000     5%  0.05
2022-09-30   0.013000    -6% -0.06
2022-10-31   0.018000     4%  0.04
2022-11-30   0.020000    -8% -0.08
2022-12-31   0.015000     7%  0.07
2023-01-31   0.022050     5%  0.05
2023-02-28   0.020800     4%  0.04
2023-03-31   0.019570     3%  0.03
2023-04-30   0.018720     4%  0.04
2023-05-31   0.018530     9%  0.09
2023-06-30   0.015040    -6% -0.06
2023-07-31   0.016050     7%  0.07
2023-08-31   0.014700     5%  0.05
2023-09-30   0.013780     6%  0.06
2023-10-31   0.017280    -4% -0.04
2023-11-30   0.020400     2%  0.02
2023-12-31   0.015150     1%  0.01
2024-01-31   0.023152     5%  0.05
2024-02-28   0.021632     4%  0.04
2024-03-31   0.020744     6%  0.06
2024-04-30   0.019094     2%  0.02
2024-05-31   0.016862    -9% -0.09
2024-06-30   0.016243     8%  0.08
2024-07-31   0.017013     6%  0.06
2024-08-31   0.013671    -7% -0.07
2024-09-30   0.014607     6%  0.06
2024-10-31   0.017971     4%  0.04
2024-11-30   0.020808     2%  0.02
2024-12-31   0.014998    -1% -0.01
英文:

You can use loop for replace missing values in next year:

df[&#39;f&#39;] = df[&#39;Factor&#39;].str.extract(r&#39;(-*\d+)&#39;).astype(int).div(100)
years = df.index.year
for y in years.unique():
#original year
m = years == y
#next year
m1 = years == y + 1
#change DatetimeIndex
df1 = df[m].rename(lambda x: x + pd.offsets.DateOffset(years=1))
#count new values and replace NaNs
s = df1[&#39;Value Col&#39;].add(df.loc[m1, &#39;f&#39;].mul(df1[&#39;Value Col&#39;]))
df.loc[m1, &#39;Value Col&#39;] = df.loc[m1, &#39;Value Col&#39;].fillna(s)

print (df)
Value Col Factor     f
2022-01-31   0.021000     5%  0.05
2022-02-28   0.020000     4%  0.04
2022-03-31   0.019000     3%  0.03
2022-04-30   0.018000     2%  0.02
2022-05-31   0.017000     9%  0.09
2022-06-30   0.016000     7%  0.07
2022-07-31   0.015000     7%  0.07
2022-08-31   0.014000     5%  0.05
2022-09-30   0.013000    -6% -0.06
2022-10-31   0.018000     4%  0.04
2022-11-30   0.020000    -8% -0.08
2022-12-31   0.015000     7%  0.07
2023-01-31   0.022050     5%  0.05
2023-02-28   0.020800     4%  0.04
2023-03-31   0.019570     3%  0.03
2023-04-30   0.018720     4%  0.04
2023-05-31   0.018530     9%  0.09
2023-06-30   0.015040    -6% -0.06
2023-07-31   0.016050     7%  0.07
2023-08-31   0.014700     5%  0.05
2023-09-30   0.013780     6%  0.06
2023-10-31   0.017280    -4% -0.04
2023-11-30   0.020400     2%  0.02
2023-12-31   0.015150     1%  0.01
2024-01-31   0.023152     5%  0.05
2024-02-28   0.021632     4%  0.04
2024-03-31   0.020744     6%  0.06
2024-04-30   0.019094     2%  0.02
2024-05-31   0.016862    -9% -0.09
2024-06-30   0.016243     8%  0.08
2024-07-31   0.017013     6%  0.06
2024-08-31   0.013671    -7% -0.07
2024-09-30   0.014607     6%  0.06
2024-10-31   0.017971     4%  0.04
2024-11-30   0.020808     2%  0.02
2024-12-31   0.014998    -1% -0.01

huangapple
  • 本文由 发表于 2023年4月13日 18:56:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76004593.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定