英文:
Convert monthly cumulative values to current month values in Pandas
问题
以下是翻译好的部分:
对于以下数据df1
,其中包含了缺失的一月份数据,cumul_val1
和cumul_val2
分别是value1
和value2
的每月累积值。
现在我想将它们转换为每月数值。例如,在2021-04-30
的value1
的值是通过50305.00-36007.05
计算得出的。可以看到一月份的值是缺失的,因此二月份的当前月值就是累积值本身,而三月份的当前月值将是三月的累积值减去二月的累积值。
请问如何实现?
预期结果:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 NaN NaN
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.96 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.94 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.22 12979.31
备注: 为了简化问题,我添加了一个新的没有缺失月份的样本数据df2
:
date cumul_val monthly_val
0 2020-09-30 32144142.46 NaN
1 2020-10-31 36061223.45 3917080.99
2 2020-11-30 40354684.50 4293461.05
3 2020-12-31 44360036.58 4005352.08
4 2021-01-31 4130729.28 4130729.28
5 2021-02-28 7985781.64 3855052.36
6 2021-03-31 12306556.74 4320775.10
7 2021-04-30 16873032.10 4566475.36
8 2021-05-31 21730065.01 4857032.91
9 2021-06-30 26816787.85 5086722.84
10 2021-07-31 31785276.80 4968488.95
11 2021-08-31 37030178.38 5244901.58
12 2021-09-30 42879767.13 5849588.75
13 2021-10-31 48392250.79 5512483.66
14 2021-11-30 53655448.65 5263197.86
15 2021-12-31 59965790.04 6310341.39
16 2022-01-31 5226910.15 5226910.15
17 2022-02-28 9481147.06 425
英文:
For the following data df1
with missing January data, cumul_val1
and cumul_val2
are the monthly cumulative values of value1
and value2
respectively.
date cumul_val1 cumul_val2
0 2020-05-31 48702.97 45919.59
1 2020-06-30 69403.68 62780.21
2 2020-07-31 83631.36 75324.61
3 2020-08-31 98485.95 88454.14
4 2020-09-30 117072.67 103484.20
5 2020-10-31 133293.80 116555.76
6 2020-11-30 150834.45 129492.36
7 2020-12-31 176086.22 141442.95
8 2021-02-28 17363.14 13985.87
9 2021-03-31 36007.05 27575.82
10 2021-04-30 50305.00 40239.76
11 2021-05-31 66383.32 54318.08
12 2021-06-30 88635.35 72179.07
13 2021-07-31 101648.18 84895.41
14 2021-08-31 114192.81 98059.73
15 2021-09-30 130331.78 112568.07
16 2021-10-31 143040.71 124933.62
17 2021-11-30 158130.73 137313.96
18 2021-12-31 179433.41 147602.08
19 2022-02-28 15702.61 14499.38
20 2022-03-31 31045.96 27764.95
21 2022-04-30 39768.15 39154.31
22 2022-05-31 50738.38 52133.62
I now want to convert them into monthly values. For example, the value of value1
on 2021-04-30
is calculated by 50305.00-36007.05
. It can be seen that the value in January is missing, so the current month value in February is the accumulated value itself, and the current month value in March will be the accumulated value in March minus the accumulated value in February.
May I ask how to achieve it?
The expected result:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 NaN NaN
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.96 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.94 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.22 12979.31
Notes: in order to simplify the question, I added a new alternative sample data df2
without missing months:
date cumul_val monthly_val
0 2020-09-30 32144142.46 NaN
1 2020-10-31 36061223.45 3917080.99
2 2020-11-30 40354684.50 4293461.05
3 2020-12-31 44360036.58 4005352.08
4 2021-01-31 4130729.28 4130729.28
5 2021-02-28 7985781.64 3855052.36
6 2021-03-31 12306556.74 4320775.10
7 2021-04-30 16873032.10 4566475.36
8 2021-05-31 21730065.01 4857032.91
9 2021-06-30 26816787.85 5086722.84
10 2021-07-31 31785276.80 4968488.95
11 2021-08-31 37030178.38 5244901.58
12 2021-09-30 42879767.13 5849588.75
13 2021-10-31 48392250.79 5512483.66
14 2021-11-30 53655448.65 5263197.86
15 2021-12-31 59965790.04 6310341.39
16 2022-01-31 5226910.15 5226910.15
17 2022-02-28 9481147.06 4254236.91
18 2022-03-31 14205738.71 4724591.65
19 2022-04-30 19096746.32 4891007.61
20 2022-05-31 24033460.77 4936714.45
21 2022-06-30 28913566.31 4880105.54
22 2022-07-31 34099663.15 5186096.84
23 2022-08-31 39082926.81 4983263.66
24 2022-09-30 44406354.61 5323427.80
25 2022-10-31 48889431.89 4483077.28
26 2022-11-30 52956747.09 4067315.20
27 2022-12-31 57184652.60 4227905.51
答案1
得分: 1
如果数据中没有间隙,问题将会很容易用 `.diff()` 解决。然而,由于存在间隙,我们需要用0填充这些间隙,然后计算差异,然后保留原始月份。
idx = pd.to_datetime(df["date"])
month_val = (
df[["cumul_val1", "cumul_val2"]]
# 使用0填充间隙月份
.set_index(idx)
.reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0)
# 取差异
.diff()
# 保留原始月份
.loc[idx]
# 为后续拼接准备数据
.set_axis(["month_val1", "month_val2"], axis=1)
.set_index(df.index)
)
result = pd.concat([df, month_val], axis=1)
编辑:楼主澄清,对于一年中的第一个条目,无论是一月还是二月,月度值与累积值相同。在这种情况下,请使用以下代码:
cumul_cols = ["cumul_val1", "cumul_val2"]
monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))]
# 确保`date`的类型为Timestamp并且数据框已排序。您的数据可能已满足这两个条件。
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date")
# 如果当前行与前一行属于同一年,则返回True。对每个cumul_val列重复此结果。
is_same_year = np.tile(
df["date"].dt.year.diff().eq(0).to_numpy()[:, None],
(1, len(cumul_cols)),
)
month_val = np.where(
is_same_year,
df[cumul_cols].diff(),
df[cumul_cols],
)
month_val[0, :] = np.nan
df[monthly_cols] = month_val
英文:
Had there been no gap in the data, the problem would have been an easy .diff()
. However, since there are gaps, we need to fill those gap with 0, calculate the diff, then keep only the original months.
idx = pd.to_datetime(df["date"])
month_val = (
df[["cumul_val1", "cumul_val2"]]
# Fill the gap months with 0
.set_index(idx)
.reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0)
# Take the diff
.diff()
# Keep only the original months
.loc[idx]
# Beat into shape for the subsequent concat
.set_axis(["month_val1", "month_val2"], axis=1)
.set_index(df.index)
)
result = pd.concat([df, month_val], axis=1)
Edit: the OP clarified that for the first entry of the year, be it Jan or Feb, the monthly value is the same as a cumulative value. In that case, use this:
cumul_cols = ["cumul_val1", "cumul_val2"]
monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))]
# Make sure `date` is of type Timestamp and the dataframe is sorted. You data
# may have satisfied both conditions already.`
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date")
# Return True if current row is in the same year as the previous row.
# Repeat the result for each cumul_val column.
is_same_year = np.tile(
df["date"].dt.year.diff().eq(0).to_numpy()[:, None],
(1, len(cumul_cols)),
)
month_val = np.where(
is_same_year,
df[cumul_cols].diff(),
df[cumul_cols],
)
month_val[0, :] = np.nan
df[monthly_cols] = month_val
答案2
得分: 1
# 将日期设置为月度PeriodIndex
df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date')
# 减去上个月
df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2)
# 将结果赋值回原始DataFrame
df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy()
英文:
It would be much easier to use your date as a PeriodIndex with monthly frequencies:
# set up the date as a monthly period Index
df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date')
# subtract the previous month
df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2)
Output:
cumul_val1 cumul_val2
date
2020-05 48702.97 45919.59
2020-06 20700.71 16860.62
2020-07 14227.68 12544.40
2020-08 14854.59 13129.53
2020-09 18586.72 15030.06
2020-10 16221.13 13071.56
2020-11 17540.65 12936.60
2020-12 25251.77 11950.59
2021-02 17363.14 13985.87
2021-03 18643.91 13589.95
2021-04 14297.95 12663.94
2021-05 16078.32 14078.32
2021-06 22252.03 17860.99
2021-07 13012.83 12716.34
2021-08 12544.63 13164.32
2021-09 16138.97 14508.34
2021-10 12708.93 12365.55
2021-11 15090.02 12380.34
2021-12 21302.68 10288.12
2022-02 15702.61 14499.38
2022-03 15343.35 13265.57
2022-04 8722.19 11389.36
2022-05 10970.23 12979.31
If you want to assign back to the original DataFrame:
df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy()
Updated df
:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 48702.97 45919.59
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.95 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.93 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.23 12979.31
答案3
得分: 0
幸运的是,pandas 提供了一个用于此目的的差异函数:
df = pd.DataFrame([['2020-05-31', 48702.97, 45919.59], ['2020-06-30', 69403.68, 62780.21], ['2020-07-31', 83631.36, 75324.61]], columns=['date', 'cumul_val1', 'cumul_val2'])
df['val1'] = df['cumul_val1'].diff()
df['val2'] = df['cumul_val2'].diff()
print(df)
英文:
Fortunately pandas offers a diff function for this:
df = pd.DataFrame([['2020-05-31',48702.97,45919.59], ['2020-06-30',69403.68,62780.21], ['2020-07-31',83631.36,75324.61]], columns=['date','cumul_val1','cumul_val2'])
df['val1'] = df['cumul_val1'].diff()
df['val2'] = df['cumul_val2'].diff()
print(df)
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论