将月度累积值转换为Pandas中的当前月份值。

huangapple go评论50阅读模式
英文:

Convert monthly cumulative values to current month values in Pandas

问题

以下是翻译好的部分:

对于以下数据df1,其中包含了缺失的一月份数据,cumul_val1cumul_val2分别是value1value2的每月累积值。

现在我想将它们转换为每月数值。例如,在2021-04-30value1的值是通过50305.00-36007.05计算得出的。可以看到一月份的值是缺失的,因此二月份的当前月值就是累积值本身,而三月份的当前月值将是三月的累积值减去二月的累积值。

请问如何实现?

预期结果:

          date  cumul_val1  cumul_val2  month_val1  month_val2
0   2020-05-31    48702.97    45919.59         NaN         NaN
1   2020-06-30    69403.68    62780.21    20700.71    16860.62
2   2020-07-31    83631.36    75324.61    14227.68    12544.40
3   2020-08-31    98485.95    88454.14    14854.59    13129.53
4   2020-09-30   117072.67   103484.20    18586.72    15030.06
5   2020-10-31   133293.80   116555.76    16221.13    13071.56
6   2020-11-30   150834.45   129492.36    17540.65    12936.60
7   2020-12-31   176086.22   141442.95    25251.77    11950.59
8   2021-02-28    17363.14    13985.87    17363.14    13985.87
9   2021-03-31    36007.05    27575.82    18643.91    13589.95
10  2021-04-30    50305.00    40239.76    14297.96    12663.94
11  2021-05-31    66383.32    54318.08    16078.32    14078.32
12  2021-06-30    88635.35    72179.07    22252.03    17860.99
13  2021-07-31   101648.18    84895.41    13012.83    12716.34
14  2021-08-31   114192.81    98059.73    12544.63    13164.32
15  2021-09-30   130331.78   112568.07    16138.97    14508.34
16  2021-10-31   143040.71   124933.62    12708.94    12365.55
17  2021-11-30   158130.73   137313.96    15090.02    12380.34
18  2021-12-31   179433.41   147602.08    21302.68    10288.12
19  2022-02-28    15702.61    14499.38    15702.61    14499.38
20  2022-03-31    31045.96    27764.95    15343.35    13265.57
21  2022-04-30    39768.15    39154.31     8722.19    11389.36
22  2022-05-31    50738.38    52133.62    10970.22    12979.31

备注: 为了简化问题,我添加了一个新的没有缺失月份的样本数据df2

          date    cumul_val  monthly_val
0   2020-09-30  32144142.46          NaN
1   2020-10-31  36061223.45   3917080.99
2   2020-11-30  40354684.50   4293461.05
3   2020-12-31  44360036.58   4005352.08
4   2021-01-31   4130729.28   4130729.28
5   2021-02-28   7985781.64   3855052.36
6   2021-03-31  12306556.74   4320775.10
7   2021-04-30  16873032.10   4566475.36
8   2021-05-31  21730065.01   4857032.91
9   2021-06-30  26816787.85   5086722.84
10  2021-07-31  31785276.80   4968488.95
11  2021-08-31  37030178.38   5244901.58
12  2021-09-30  42879767.13   5849588.75
13  2021-10-31  48392250.79   5512483.66
14  2021-11-30  53655448.65   5263197.86
15  2021-12-31  59965790.04   6310341.39
16  2022-01-31   5226910.15   5226910.15
17  2022-02-28   9481147.06   425
英文:

For the following data df1 with missing January data, cumul_val1 and cumul_val2 are the monthly cumulative values of value1 and value2 respectively.

          date  cumul_val1  cumul_val2
0   2020-05-31    48702.97    45919.59
1   2020-06-30    69403.68    62780.21
2   2020-07-31    83631.36    75324.61
3   2020-08-31    98485.95    88454.14
4   2020-09-30   117072.67   103484.20
5   2020-10-31   133293.80   116555.76
6   2020-11-30   150834.45   129492.36
7   2020-12-31   176086.22   141442.95
8   2021-02-28    17363.14    13985.87
9   2021-03-31    36007.05    27575.82
10  2021-04-30    50305.00    40239.76
11  2021-05-31    66383.32    54318.08
12  2021-06-30    88635.35    72179.07
13  2021-07-31   101648.18    84895.41
14  2021-08-31   114192.81    98059.73
15  2021-09-30   130331.78   112568.07
16  2021-10-31   143040.71   124933.62
17  2021-11-30   158130.73   137313.96
18  2021-12-31   179433.41   147602.08
19  2022-02-28    15702.61    14499.38
20  2022-03-31    31045.96    27764.95
21  2022-04-30    39768.15    39154.31
22  2022-05-31    50738.38    52133.62

I now want to convert them into monthly values. For example, the value of value1 on 2021-04-30 is calculated by 50305.00-36007.05. It can be seen that the value in January is missing, so the current month value in February is the accumulated value itself, and the current month value in March will be the accumulated value in March minus the accumulated value in February.

May I ask how to achieve it?

The expected result:

          date  cumul_val1  cumul_val2  month_val1  month_val2
0   2020-05-31    48702.97    45919.59         NaN         NaN
1   2020-06-30    69403.68    62780.21    20700.71    16860.62
2   2020-07-31    83631.36    75324.61    14227.68    12544.40
3   2020-08-31    98485.95    88454.14    14854.59    13129.53
4   2020-09-30   117072.67   103484.20    18586.72    15030.06
5   2020-10-31   133293.80   116555.76    16221.13    13071.56
6   2020-11-30   150834.45   129492.36    17540.65    12936.60
7   2020-12-31   176086.22   141442.95    25251.77    11950.59
8   2021-02-28    17363.14    13985.87    17363.14    13985.87
9   2021-03-31    36007.05    27575.82    18643.91    13589.95
10  2021-04-30    50305.00    40239.76    14297.96    12663.94
11  2021-05-31    66383.32    54318.08    16078.32    14078.32
12  2021-06-30    88635.35    72179.07    22252.03    17860.99
13  2021-07-31   101648.18    84895.41    13012.83    12716.34
14  2021-08-31   114192.81    98059.73    12544.63    13164.32
15  2021-09-30   130331.78   112568.07    16138.97    14508.34
16  2021-10-31   143040.71   124933.62    12708.94    12365.55
17  2021-11-30   158130.73   137313.96    15090.02    12380.34
18  2021-12-31   179433.41   147602.08    21302.68    10288.12
19  2022-02-28    15702.61    14499.38    15702.61    14499.38
20  2022-03-31    31045.96    27764.95    15343.35    13265.57
21  2022-04-30    39768.15    39154.31     8722.19    11389.36
22  2022-05-31    50738.38    52133.62    10970.22    12979.31

Notes: in order to simplify the question, I added a new alternative sample data df2 without missing months:

          date    cumul_val  monthly_val
0   2020-09-30  32144142.46          NaN
1   2020-10-31  36061223.45   3917080.99
2   2020-11-30  40354684.50   4293461.05
3   2020-12-31  44360036.58   4005352.08
4   2021-01-31   4130729.28   4130729.28
5   2021-02-28   7985781.64   3855052.36
6   2021-03-31  12306556.74   4320775.10
7   2021-04-30  16873032.10   4566475.36
8   2021-05-31  21730065.01   4857032.91
9   2021-06-30  26816787.85   5086722.84
10  2021-07-31  31785276.80   4968488.95
11  2021-08-31  37030178.38   5244901.58
12  2021-09-30  42879767.13   5849588.75
13  2021-10-31  48392250.79   5512483.66
14  2021-11-30  53655448.65   5263197.86
15  2021-12-31  59965790.04   6310341.39
16  2022-01-31   5226910.15   5226910.15
17  2022-02-28   9481147.06   4254236.91
18  2022-03-31  14205738.71   4724591.65
19  2022-04-30  19096746.32   4891007.61
20  2022-05-31  24033460.77   4936714.45
21  2022-06-30  28913566.31   4880105.54
22  2022-07-31  34099663.15   5186096.84
23  2022-08-31  39082926.81   4983263.66
24  2022-09-30  44406354.61   5323427.80
25  2022-10-31  48889431.89   4483077.28
26  2022-11-30  52956747.09   4067315.20
27  2022-12-31  57184652.60   4227905.51

答案1

得分: 1

如果数据中没有间隙问题将会很容易用 `.diff()` 解决然而由于存在间隙我们需要用0填充这些间隙然后计算差异然后保留原始月份

idx = pd.to_datetime(df["date"])
month_val = (
    df[["cumul_val1", "cumul_val2"]]
    # 使用0填充间隙月份
    .set_index(idx)
    .reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0)
    # 取差异
    .diff()
    # 保留原始月份
    .loc[idx]
    # 为后续拼接准备数据
    .set_axis(["month_val1", "month_val2"], axis=1)
    .set_index(df.index)
)

result = pd.concat([df, month_val], axis=1)

编辑:楼主澄清,对于一年中的第一个条目,无论是一月还是二月,月度值与累积值相同。在这种情况下,请使用以下代码:

cumul_cols = ["cumul_val1", "cumul_val2"]
monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))]

# 确保`date`的类型为Timestamp并且数据框已排序。您的数据可能已满足这两个条件。
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date")

# 如果当前行与前一行属于同一年,则返回True。对每个cumul_val列重复此结果。
is_same_year = np.tile(
    df["date"].dt.year.diff().eq(0).to_numpy()[:, None],
    (1, len(cumul_cols)),
)

month_val = np.where(
    is_same_year,
    df[cumul_cols].diff(),
    df[cumul_cols],
)
month_val[0, :] = np.nan

df[monthly_cols] = month_val
英文:

Had there been no gap in the data, the problem would have been an easy .diff(). However, since there are gaps, we need to fill those gap with 0, calculate the diff, then keep only the original months.

idx = pd.to_datetime(df["date"])
month_val = (
    df[["cumul_val1", "cumul_val2"]]
    # Fill the gap months with 0
    .set_index(idx)
    .reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0)
    # Take the diff
    .diff()
    # Keep only the original months
    .loc[idx]
    # Beat into shape for the subsequent concat
    .set_axis(["month_val1", "month_val2"], axis=1)
    .set_index(df.index)
)

result = pd.concat([df, month_val], axis=1)

Edit: the OP clarified that for the first entry of the year, be it Jan or Feb, the monthly value is the same as a cumulative value. In that case, use this:

cumul_cols = ["cumul_val1", "cumul_val2"]
monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))]

# Make sure `date` is of type Timestamp and the dataframe is sorted. You data
# may have satisfied both conditions already.`
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date")

# Return True if current row is in the same year as the previous row.
# Repeat the result for each cumul_val column.
is_same_year = np.tile(
    df["date"].dt.year.diff().eq(0).to_numpy()[:, None],
    (1, len(cumul_cols)),
)

month_val = np.where(
    is_same_year,
    df[cumul_cols].diff(),
    df[cumul_cols],
)
month_val[0, :] = np.nan

df[monthly_cols] = month_val

答案2

得分: 1

# 将日期设置为月度PeriodIndex
df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date')

# 减去上个月
df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2)

# 将结果赋值回原始DataFrame
df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy()
英文:

It would be much easier to use your date as a PeriodIndex with monthly frequencies:

# set up the date as a monthly period Index
df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date')

# subtract the previous month
df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2)

Output:

         cumul_val1  cumul_val2
date                           
2020-05    48702.97    45919.59
2020-06    20700.71    16860.62
2020-07    14227.68    12544.40
2020-08    14854.59    13129.53
2020-09    18586.72    15030.06
2020-10    16221.13    13071.56
2020-11    17540.65    12936.60
2020-12    25251.77    11950.59
2021-02    17363.14    13985.87
2021-03    18643.91    13589.95
2021-04    14297.95    12663.94
2021-05    16078.32    14078.32
2021-06    22252.03    17860.99
2021-07    13012.83    12716.34
2021-08    12544.63    13164.32
2021-09    16138.97    14508.34
2021-10    12708.93    12365.55
2021-11    15090.02    12380.34
2021-12    21302.68    10288.12
2022-02    15702.61    14499.38
2022-03    15343.35    13265.57
2022-04     8722.19    11389.36
2022-05    10970.23    12979.31

If you want to assign back to the original DataFrame:

df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy()

Updated df:

          date  cumul_val1  cumul_val2  month_val1  month_val2
0   2020-05-31    48702.97    45919.59    48702.97    45919.59
1   2020-06-30    69403.68    62780.21    20700.71    16860.62
2   2020-07-31    83631.36    75324.61    14227.68    12544.40
3   2020-08-31    98485.95    88454.14    14854.59    13129.53
4   2020-09-30   117072.67   103484.20    18586.72    15030.06
5   2020-10-31   133293.80   116555.76    16221.13    13071.56
6   2020-11-30   150834.45   129492.36    17540.65    12936.60
7   2020-12-31   176086.22   141442.95    25251.77    11950.59
8   2021-02-28    17363.14    13985.87    17363.14    13985.87
9   2021-03-31    36007.05    27575.82    18643.91    13589.95
10  2021-04-30    50305.00    40239.76    14297.95    12663.94
11  2021-05-31    66383.32    54318.08    16078.32    14078.32
12  2021-06-30    88635.35    72179.07    22252.03    17860.99
13  2021-07-31   101648.18    84895.41    13012.83    12716.34
14  2021-08-31   114192.81    98059.73    12544.63    13164.32
15  2021-09-30   130331.78   112568.07    16138.97    14508.34
16  2021-10-31   143040.71   124933.62    12708.93    12365.55
17  2021-11-30   158130.73   137313.96    15090.02    12380.34
18  2021-12-31   179433.41   147602.08    21302.68    10288.12
19  2022-02-28    15702.61    14499.38    15702.61    14499.38
20  2022-03-31    31045.96    27764.95    15343.35    13265.57
21  2022-04-30    39768.15    39154.31     8722.19    11389.36
22  2022-05-31    50738.38    52133.62    10970.23    12979.31

答案3

得分: 0

幸运的是,pandas 提供了一个用于此目的的差异函数:

df = pd.DataFrame([['2020-05-31', 48702.97, 45919.59], ['2020-06-30', 69403.68, 62780.21], ['2020-07-31', 83631.36, 75324.61]], columns=['date', 'cumul_val1', 'cumul_val2'])
df['val1'] = df['cumul_val1'].diff()
df['val2'] = df['cumul_val2'].diff()
print(df)
英文:

Fortunately pandas offers a diff function for this:

df = pd.DataFrame([['2020-05-31',48702.97,45919.59], ['2020-06-30',69403.68,62780.21], ['2020-07-31',83631.36,75324.61]], columns=['date','cumul_val1','cumul_val2'])
df['val1'] = df['cumul_val1'].diff()
df['val2'] = df['cumul_val2'].diff()
print(df)

</details>



huangapple
  • 本文由 发表于 2023年2月8日 09:42:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75380633.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定