2023年2月8日 09:42:09go评论64阅读模式

英文:

Convert monthly cumulative values to current month values in Pandas

问题

以下是翻译好的部分：

对于以下数据df1，其中包含了缺失的一月份数据，cumul_val1和cumul_val2分别是value1和value2的每月累积值。

现在我想将它们转换为每月数值。例如，在2021-04-30的value1的值是通过50305.00-36007.05计算得出的。可以看到一月份的值是缺失的，因此二月份的当前月值就是累积值本身，而三月份的当前月值将是三月的累积值减去二月的累积值。

请问如何实现？

预期结果：

          date  cumul_val1  cumul_val2  month_val1  month_val2
0   2020-05-31    48702.97    45919.59         NaN         NaN
1   2020-06-30    69403.68    62780.21    20700.71    16860.62
2   2020-07-31    83631.36    75324.61    14227.68    12544.40
3   2020-08-31    98485.95    88454.14    14854.59    13129.53
4   2020-09-30   117072.67   103484.20    18586.72    15030.06
5   2020-10-31   133293.80   116555.76    16221.13    13071.56
6   2020-11-30   150834.45   129492.36    17540.65    12936.60
7   2020-12-31   176086.22   141442.95    25251.77    11950.59
8   2021-02-28    17363.14    13985.87    17363.14    13985.87
9   2021-03-31    36007.05    27575.82    18643.91    13589.95
10  2021-04-30    50305.00    40239.76    14297.96    12663.94
11  2021-05-31    66383.32    54318.08    16078.32    14078.32
12  2021-06-30    88635.35    72179.07    22252.03    17860.99
13  2021-07-31   101648.18    84895.41    13012.83    12716.34
14  2021-08-31   114192.81    98059.73    12544.63    13164.32
15  2021-09-30   130331.78   112568.07    16138.97    14508.34
16  2021-10-31   143040.71   124933.62    12708.94    12365.55
17  2021-11-30   158130.73   137313.96    15090.02    12380.34
18  2021-12-31   179433.41   147602.08    21302.68    10288.12
19  2022-02-28    15702.61    14499.38    15702.61    14499.38
20  2022-03-31    31045.96    27764.95    15343.35    13265.57
21  2022-04-30    39768.15    39154.31     8722.19    11389.36
22  2022-05-31    50738.38    52133.62    10970.22    12979.31

备注： 为了简化问题，我添加了一个新的没有缺失月份的样本数据df2：

          date    cumul_val  monthly_val
0   2020-09-30  32144142.46          NaN
1   2020-10-31  36061223.45   3917080.99
2   2020-11-30  40354684.50   4293461.05
3   2020-12-31  44360036.58   4005352.08
4   2021-01-31   4130729.28   4130729.28
5   2021-02-28   7985781.64   3855052.36
6   2021-03-31  12306556.74   4320775.10
7   2021-04-30  16873032.10   4566475.36
8   2021-05-31  21730065.01   4857032.91
9   2021-06-30  26816787.85   5086722.84
10  2021-07-31  31785276.80   4968488.95
11  2021-08-31  37030178.38   5244901.58
12  2021-09-30  42879767.13   5849588.75
13  2021-10-31  48392250.79   5512483.66
14  2021-11-30  53655448.65   5263197.86
15  2021-12-31  59965790.04   6310341.39
16  2022-01-31   5226910.15   5226910.15
17  2022-02-28   9481147.06   425

英文:

For the following data df1 with missing January data, cumul_val1 and cumul_val2 are the monthly cumulative values of value1 and value2 respectively.

          date  cumul_val1  cumul_val2
0   2020-05-31    48702.97    45919.59
1   2020-06-30    69403.68    62780.21
2   2020-07-31    83631.36    75324.61
3   2020-08-31    98485.95    88454.14
4   2020-09-30   117072.67   103484.20
5   2020-10-31   133293.80   116555.76
6   2020-11-30   150834.45   129492.36
7   2020-12-31   176086.22   141442.95
8   2021-02-28    17363.14    13985.87
9   2021-03-31    36007.05    27575.82
10  2021-04-30    50305.00    40239.76
11  2021-05-31    66383.32    54318.08
12  2021-06-30    88635.35    72179.07
13  2021-07-31   101648.18    84895.41
14  2021-08-31   114192.81    98059.73
15  2021-09-30   130331.78   112568.07
16  2021-10-31   143040.71   124933.62
17  2021-11-30   158130.73   137313.96
18  2021-12-31   179433.41   147602.08
19  2022-02-28    15702.61    14499.38
20  2022-03-31    31045.96    27764.95
21  2022-04-30    39768.15    39154.31
22  2022-05-31    50738.38    52133.62

I now want to convert them into monthly values. For example, the value of value1 on 2021-04-30 is calculated by 50305.00-36007.05. It can be seen that the value in January is missing, so the current month value in February is the accumulated value itself, and the current month value in March will be the accumulated value in March minus the accumulated value in February.

May I ask how to achieve it?

The expected result:

          date  cumul_val1  cumul_val2  month_val1  month_val2
0   2020-05-31    48702.97    45919.59         NaN         NaN
1   2020-06-30    69403.68    62780.21    20700.71    16860.62
2   2020-07-31    83631.36    75324.61    14227.68    12544.40
3   2020-08-31    98485.95    88454.14    14854.59    13129.53
4   2020-09-30   117072.67   103484.20    18586.72    15030.06
5   2020-10-31   133293.80   116555.76    16221.13    13071.56
6   2020-11-30   150834.45   129492.36    17540.65    12936.60
7   2020-12-31   176086.22   141442.95    25251.77    11950.59
8   2021-02-28    17363.14    13985.87    17363.14    13985.87
9   2021-03-31    36007.05    27575.82    18643.91    13589.95
10  2021-04-30    50305.00    40239.76    14297.96    12663.94
11  2021-05-31    66383.32    54318.08    16078.32    14078.32
12  2021-06-30    88635.35    72179.07    22252.03    17860.99
13  2021-07-31   101648.18    84895.41    13012.83    12716.34
14  2021-08-31   114192.81    98059.73    12544.63    13164.32
15  2021-09-30   130331.78   112568.07    16138.97    14508.34
16  2021-10-31   143040.71   124933.62    12708.94    12365.55
17  2021-11-30   158130.73   137313.96    15090.02    12380.34
18  2021-12-31   179433.41   147602.08    21302.68    10288.12
19  2022-02-28    15702.61    14499.38    15702.61    14499.38
20  2022-03-31    31045.96    27764.95    15343.35    13265.57
21  2022-04-30    39768.15    39154.31     8722.19    11389.36
22  2022-05-31    50738.38    52133.62    10970.22    12979.31

Notes: in order to simplify the question, I added a new alternative sample data df2 without missing months:

          date    cumul_val  monthly_val
0   2020-09-30  32144142.46          NaN
1   2020-10-31  36061223.45   3917080.99
2   2020-11-30  40354684.50   4293461.05
3   2020-12-31  44360036.58   4005352.08
4   2021-01-31   4130729.28   4130729.28
5   2021-02-28   7985781.64   3855052.36
6   2021-03-31  12306556.74   4320775.10
7   2021-04-30  16873032.10   4566475.36
8   2021-05-31  21730065.01   4857032.91
9   2021-06-30  26816787.85   5086722.84
10  2021-07-31  31785276.80   4968488.95
11  2021-08-31  37030178.38   5244901.58
12  2021-09-30  42879767.13   5849588.75
13  2021-10-31  48392250.79   5512483.66
14  2021-11-30  53655448.65   5263197.86
15  2021-12-31  59965790.04   6310341.39
16  2022-01-31   5226910.15   5226910.15
17  2022-02-28   9481147.06   4254236.91
18  2022-03-31  14205738.71   4724591.65
19  2022-04-30  19096746.32   4891007.61
20  2022-05-31  24033460.77   4936714.45
21  2022-06-30  28913566.31   4880105.54
22  2022-07-31  34099663.15   5186096.84
23  2022-08-31  39082926.81   4983263.66
24  2022-09-30  44406354.61   5323427.80
25  2022-10-31  48889431.89   4483077.28
26  2022-11-30  52956747.09   4067315.20
27  2022-12-31  57184652.60   4227905.51

答案1

得分: 1

如果数据中没有间隙，问题将会很容易用 `.diff()` 解决。然而，由于存在间隙，我们需要用0填充这些间隙，然后计算差异，然后保留原始月份。

idx = pd.to_datetime(df["date"])
month_val = (
    df[["cumul_val1", "cumul_val2"]]
    # 使用0填充间隙月份
    .set_index(idx)
    .reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0)
    # 取差异
    .diff()
    # 保留原始月份
    .loc[idx]
    # 为后续拼接准备数据
    .set_axis(["month_val1", "month_val2"], axis=1)
    .set_index(df.index)
)

result = pd.concat([df, month_val], axis=1)

编辑：楼主澄清，对于一年中的第一个条目，无论是一月还是二月，月度值与累积值相同。在这种情况下，请使用以下代码：

cumul_cols = ["cumul_val1", "cumul_val2"]
monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))]

# 确保`date`的类型为Timestamp并且数据框已排序。您的数据可能已满足这两个条件。
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date")

# 如果当前行与前一行属于同一年，则返回True。对每个cumul_val列重复此结果。
is_same_year = np.tile(
    df["date"].dt.year.diff().eq(0).to_numpy()[:, None],
    (1, len(cumul_cols)),
)

month_val = np.where(
    is_same_year,
    df[cumul_cols].diff(),
    df[cumul_cols],
)
month_val[0, :] = np.nan

df[monthly_cols] = month_val

英文:

Had there been no gap in the data, the problem would have been an easy .diff(). However, since there are gaps, we need to fill those gap with 0, calculate the diff, then keep only the original months.

idx = pd.to_datetime(df[&quot;date&quot;])
month_val = (
    df[[&quot;cumul_val1&quot;, &quot;cumul_val2&quot;]]
    # Fill the gap months with 0
    .set_index(idx)
    .reindex(pd.date_range(idx.min(), idx.max(), freq=&quot;M&quot;), fill_value=0)
    # Take the diff
    .diff()
    # Keep only the original months
    .loc[idx]
    # Beat into shape for the subsequent concat
    .set_axis([&quot;month_val1&quot;, &quot;month_val2&quot;], axis=1)
    .set_index(df.index)
)

result = pd.concat([df, month_val], axis=1)

Edit: the OP clarified that for the first entry of the year, be it Jan or Feb, the monthly value is the same as a cumulative value. In that case, use this:

cumul_cols = [&quot;cumul_val1&quot;, &quot;cumul_val2&quot;]
monthly_cols = [f&quot;month_val{i+1}&quot; for i in range(len(cumul_cols))]

# Make sure `date` is of type Timestamp and the dataframe is sorted. You data
# may have satisfied both conditions already.`
df[&quot;date&quot;] = pd.to_datetime(df[&quot;date&quot;])
df = df.sort_values(&quot;date&quot;)

# Return True if current row is in the same year as the previous row.
# Repeat the result for each cumul_val column.
is_same_year = np.tile(
    df[&quot;date&quot;].dt.year.diff().eq(0).to_numpy()[:, None],
    (1, len(cumul_cols)),
)

month_val = np.where(
    is_same_year,
    df[cumul_cols].diff(),
    df[cumul_cols],
)
month_val[0, :] = np.nan

df[monthly_cols] = month_val

答案2

得分: 1

# 将日期设置为月度PeriodIndex
df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date')

# 减去上个月
df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2)

# 将结果赋值回原始DataFrame
df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy()

英文:

It would be much easier to use your date as a PeriodIndex with monthly frequencies:

# set up the date as a monthly period Index
df2 = df.assign(date=pd.to_datetime(df[&#39;date&#39;]).dt.to_period(&#39;M&#39;)).set_index(&#39;date&#39;)

# subtract the previous month
df2.sub(df2.shift(freq=&#39;1M&#39;), fill_value=0).reindex_like(df2)

Output:

         cumul_val1  cumul_val2
date                           
2020-05    48702.97    45919.59
2020-06    20700.71    16860.62
2020-07    14227.68    12544.40
2020-08    14854.59    13129.53
2020-09    18586.72    15030.06
2020-10    16221.13    13071.56
2020-11    17540.65    12936.60
2020-12    25251.77    11950.59
2021-02    17363.14    13985.87
2021-03    18643.91    13589.95
2021-04    14297.95    12663.94
2021-05    16078.32    14078.32
2021-06    22252.03    17860.99
2021-07    13012.83    12716.34
2021-08    12544.63    13164.32
2021-09    16138.97    14508.34
2021-10    12708.93    12365.55
2021-11    15090.02    12380.34
2021-12    21302.68    10288.12
2022-02    15702.61    14499.38
2022-03    15343.35    13265.57
2022-04     8722.19    11389.36
2022-05    10970.23    12979.31

If you want to assign back to the original DataFrame:

df[[&#39;month_val1&#39;, &#39;month_val2&#39;]] = df2.sub(df2.shift(freq=&#39;1M&#39;), fill_value=0).reindex_like(df2).to_numpy()

Updated df:

          date  cumul_val1  cumul_val2  month_val1  month_val2
0   2020-05-31    48702.97    45919.59    48702.97    45919.59
1   2020-06-30    69403.68    62780.21    20700.71    16860.62
2   2020-07-31    83631.36    75324.61    14227.68    12544.40
3   2020-08-31    98485.95    88454.14    14854.59    13129.53
4   2020-09-30   117072.67   103484.20    18586.72    15030.06
5   2020-10-31   133293.80   116555.76    16221.13    13071.56
6   2020-11-30   150834.45   129492.36    17540.65    12936.60
7   2020-12-31   176086.22   141442.95    25251.77    11950.59
8   2021-02-28    17363.14    13985.87    17363.14    13985.87
9   2021-03-31    36007.05    27575.82    18643.91    13589.95
10  2021-04-30    50305.00    40239.76    14297.95    12663.94
11  2021-05-31    66383.32    54318.08    16078.32    14078.32
12  2021-06-30    88635.35    72179.07    22252.03    17860.99
13  2021-07-31   101648.18    84895.41    13012.83    12716.34
14  2021-08-31   114192.81    98059.73    12544.63    13164.32
15  2021-09-30   130331.78   112568.07    16138.97    14508.34
16  2021-10-31   143040.71   124933.62    12708.93    12365.55
17  2021-11-30   158130.73   137313.96    15090.02    12380.34
18  2021-12-31   179433.41   147602.08    21302.68    10288.12
19  2022-02-28    15702.61    14499.38    15702.61    14499.38
20  2022-03-31    31045.96    27764.95    15343.35    13265.57
21  2022-04-30    39768.15    39154.31     8722.19    11389.36
22  2022-05-31    50738.38    52133.62    10970.23    12979.31

答案3

得分: 0

幸运的是，pandas 提供了一个用于此目的的差异函数：

df = pd.DataFrame([['2020-05-31', 48702.97, 45919.59], ['2020-06-30', 69403.68, 62780.21], ['2020-07-31', 83631.36, 75324.61]], columns=['date', 'cumul_val1', 'cumul_val2'])
df['val1'] = df['cumul_val1'].diff()
df['val2'] = df['cumul_val2'].diff()
print(df)

英文:

Fortunately pandas offers a diff function for this:

df = pd.DataFrame([[&#39;2020-05-31&#39;,48702.97,45919.59], [&#39;2020-06-30&#39;,69403.68,62780.21], [&#39;2020-07-31&#39;,83631.36,75324.61]], columns=[&#39;date&#39;,&#39;cumul_val1&#39;,&#39;cumul_val2&#39;])
df[&#39;val1&#39;] = df[&#39;cumul_val1&#39;].diff()
df[&#39;val2&#39;] = df[&#39;cumul_val2&#39;].diff()
print(df)

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将月度累积值转换为Pandas中的当前月份值。

问题

答案1

答案2

答案3

Data frame indexing not working as it should be. Does not give error as well. Pandas-Python.

Saved matplotlib size on pdf is larger than original figure

Substitute specific values in a dataframe by matching strings stored in another dataframe

是不是可能将多个值输入到Python函数的单个变量中？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论