2023年6月12日 16:13:30go评论97阅读模式

英文:

How to find how the values have changed from the begining of their period?

问题

你可以尝试以下代码来实现你的目标：

# 转换日期列为日期时间类型
df['date'] = pd.to_datetime(df['date'])
# 找到每个周期的开始日期（每年的8月1日）
cycle_start_dates = df[df.index.month == 8]
# 使用reindex方法将周期开始日期应用到整个数据框
df_reindexed = df.reindex(cycle_start_dates.index, method='ffill')
# 计算每个月与其对应的周期开始日期的差异
df_difference = df - df_reindexed
# 打印结果
print(df_difference)

这段代码将计算每个月与其对应的周期开始日期的差异，以实现你想要的目标。

英文:

I have a dataframe with timeindex. There is a period (cycle) that starts every august. I want to calculate the difference between the value of each month and the value that was on the previous august (the beginning of its period).
The aim is to know how much the values have changed during each cycle.

This is an small example of the data:

import pandas as pd
import numpy as np
df = pd.DataFrame({&#39;date&#39;: [&#39;2022-01-01 00:00:00&#39;, &#39;2022-02-01 00:00:00&#39;, 
                          &#39;2021-03-01 00:00:00&#39;, &#39;2021-04-01 00:00:00&#39;,
                          &#39;2021-05-01 00:00:00&#39;, &#39;2021-06-01 00:00:00&#39;,
                          &#39;2021-07-01 00:00:00&#39;, &#39;2021-08-01 00:00:00&#39;,
                          &#39;2021-09-01 00:00:00&#39;, &#39;2021-10-01 00:00:00&#39;,
                          &#39;2021-11-01 00:00:00&#39;, &#39;2021-12-01 00:00:00&#39;,
                          &#39;2022-01-01 00:00:00&#39;, &#39;2022-02-01 00:00:00&#39;,
                          &#39;2022-03-01 00:00:00&#39;, &#39;2022-04-01 00:00:00&#39;,
                          &#39;2022-05-01 00:00:00&#39;, &#39;2022-06-01 00:00:00&#39;,
                          &#39;2022-07-01 00:00:00&#39;, &#39;2022-08-01 00:00:00&#39;,
                          &#39;2022-09-01 00:00:00&#39;, &#39;2022-10-01 00:00:00&#39;,
                          &#39;2022-11-01 00:00:00&#39;, &#39;2023-12-01 00:00:00&#39;,
                          &#39;2023-01-01 00:00:00&#39;, &#39;2023-02-01 00:00:00&#39;,
                          &#39;2023-03-01 00:00:00&#39;, &#39;2023-04-01 00:00:00&#39;,
                          &#39;2023-05-01 00:00:00&#39;, &#39;2023-06-01 00:00:00&#39;,
                          ], 
               &#39;value1&#39;: [0.452762281,0.372262281,0.513928948,0.447762281,
                            0.377095615,0.355095615,0.271428948,0.291762281,
                            0.476762281,0.335928948,0.280428948,0.283762281,
                            0.322928948,0.287262281,0.316928948,0.209262281,
                            0.407928948,0.254262281,0.232095615,0.264262281,
                            0.076095615,-0.025237719,-0.042237719,-0.094904385,
                            0.017428948,-0.036071052,-0.094071052,-0.071404385,
                            0.008095615,-0.141571052],
               &#39;value2&#39;: [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
                               6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
                               1.6,0.4,1.5,1.4,2.2,1.2]})
df = df.set_index(&#39;date&#39;)

I have tried to get a dataframe with the same structure but all the values have the value that was on its corresponding august, and then I just need to subtract one dataframe from the other.
But I couldn't find a way to do it neither.

Thank you vary much in advance.

答案1

得分: 0

如果我理解您的意思正确的话，您可以使用.groupby()然后计算每个组的差异：

df.index = pd.to_datetime(df.index)
tmp = (df.index.month == 8).cumsum()
out = (
    df[tmp != 0]
    .groupby(tmp[tmp > 0], group_keys=False)
    .apply(lambda x: x[["value1", "value2"]] - x[["value1", "value2"]].iloc[0])
).add_prefix('calculated_')
x = df.join(out)
print(x)

打印结果：

              value1  value2  calculated_value1  calculated_value2
date                                                              
2021-03-01  0.513929     8.4                NaN                NaN
2021-04-01  0.447762     6.2                NaN                NaN
2021-05-01  0.377096     6.2                NaN                NaN
2021-06-01  0.355096     6.0                NaN                NaN
2021-07-01  0.271429     3.9                NaN                NaN
2021-08-01  0.291762     8.5           0.000000                0.0
2021-09-01  0.476762     8.3           0.185000               -0.2
2021-10-01  0.335929     5.3           0.044167               -3.2
2021-11-01  0.280429     5.6          -0.011333               -2.9
2021-12-01  0.283762     5.3          -0.008000               -3.2
2022-01-01  0.452762     9.6           0.031167               -2.3
2022-01-01  0.322929     6.2           0.031167               -2.3
2022-02-01  0.372262     8.0          -0.004500               -2.2
2022-02-01  0.287262     6.3          -0.004500               -2.2
2022-03-01  0.316929     6.9           0.025167               -1.6
2022-04-01  0.209262     4.8          -0.082500               -3.7
2022-05-01  0.407929     6.7           0.116167               -1.8
2022-06-01  0.254262     3.6          -0.037500               -4.9
2022-07-01  0.232096     3.0          -0.059667               -5.5
2022-08-01  0.264262     4.6           0.000000                0.0
2022-09-01  0.076096     2.3          -0.188167               -2.3
2022-10-01 -0.025238     1.3          -0.289500               -3.3
2022-11-01 -0.042238     1.0          -0.306500               -3.6
2023-01-01  0.017429     1.6          -0.246833               -3.0
2023-02-01 -0.036071     0.4          -0.300333               -4.2
2023-03-01 -0.094071     1.5          -0.358333               -3.1
2023-04-01 -0.071404     1.4          -0.335667               -3.2
2023-05-01  0.008096     2.2          -0.256167               -2.4
2023-06-01 -0.141571     1.2          -0.405833               -3.4
2023-12-01 -0.094904     0.3          -0.359167               -4.3

英文:

If I understand you correctly, you can use .groupby() and then calculate the difference in each group:

df.index = pd.to_datetime(df.index)
tmp = (df.index.month == 8).cumsum()
out = (
    df[tmp != 0]
    .groupby(tmp[tmp &gt; 0], group_keys=False)
    .apply(lambda x: x[[&quot;value1&quot;, &quot;value2&quot;]] - x[[&quot;value1&quot;, &quot;value2&quot;]].iloc[0])
).add_prefix(&#39;calculated_&#39;)
x = df.join(out)
print(x)

Prints:

              value1  value2  calculated_value1  calculated_value2
date                                                              
2021-03-01  0.513929     8.4                NaN                NaN
2021-04-01  0.447762     6.2                NaN                NaN
2021-05-01  0.377096     6.2                NaN                NaN
2021-06-01  0.355096     6.0                NaN                NaN
2021-07-01  0.271429     3.9                NaN                NaN
2021-08-01  0.291762     8.5           0.000000                0.0
2021-09-01  0.476762     8.3           0.185000               -0.2
2021-10-01  0.335929     5.3           0.044167               -3.2
2021-11-01  0.280429     5.6          -0.011333               -2.9
2021-12-01  0.283762     5.3          -0.008000               -3.2
2022-01-01  0.452762     9.6           0.031167               -2.3
2022-01-01  0.322929     6.2           0.031167               -2.3
2022-02-01  0.372262     8.0          -0.004500               -2.2
2022-02-01  0.287262     6.3          -0.004500               -2.2
2022-03-01  0.316929     6.9           0.025167               -1.6
2022-04-01  0.209262     4.8          -0.082500               -3.7
2022-05-01  0.407929     6.7           0.116167               -1.8
2022-06-01  0.254262     3.6          -0.037500               -4.9
2022-07-01  0.232096     3.0          -0.059667               -5.5
2022-08-01  0.264262     4.6           0.000000                0.0
2022-09-01  0.076096     2.3          -0.188167               -2.3
2022-10-01 -0.025238     1.3          -0.289500               -3.3
2022-11-01 -0.042238     1.0          -0.306500               -3.6
2023-01-01  0.017429     1.6          -0.246833               -3.0
2023-02-01 -0.036071     0.4          -0.300333               -4.2
2023-03-01 -0.094071     1.5          -0.358333               -3.1
2023-04-01 -0.071404     1.4          -0.335667               -3.2
2023-05-01  0.008096     2.2          -0.256167               -2.4
2023-06-01 -0.141571     1.2          -0.405833               -3.4
2023-12-01 -0.094904     0.3          -0.359167               -4.3

答案2

得分: 0

以下是您要翻译的内容：

There are several ways to solve your issues. The solution depends on how your data might looks like

There are some flaws in your code too. Here is my suggestion

You don't need to set_index('date') here.

Here is:

df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['date'], ignore_index=True, inplace=True)

The first line is to convert your date column to datetime format, and the second is to sort your data in ascending order.

The most common way is to extract subsample with august only then merge those value back to your main dataframe and perform the calculation. This might be long, but it will shows any potential error if there is in your data

The complete solution:

import pandas as pd
import numpy as np
from dateutil.relativedelta import *
from pandas.tseries.offsets import *
df = pd.DataFrame({'date': ['2022-01-01 00:00:00', '2022-02-01 00:00:00', 
                            '2021-03-01 00:00:00', '2021-04-01 00:00:00',
                            '2021-05-01 00:00:00', '2021-06-01 00:00:00',
                            '2021-07-01 00:00:00', '2021-08-01 00:00:00',
                            '2021-09-01 00:00:00', '2021-10-01 00:00:00',
                            '2021-11-01 00:00:00', '2021-12-01 00:00:00',
                            '2022-01-01 00:00:00', '2022-02-01 00:00:00',
                            '2022-03-01 00:00:00', '2022-04-01 00:00:00',
                            '2022-05-01 00:00:00', '2022-06-01 00:00:00',
                            '2022-07-01 00:00:00', '2022-08-01 00:00:00',
                            '2022-09-01 00:00:00', '2022-10-01 00:00:00',
                            '2022-11-01 00:00:00', '2023-12-01 00:00:00',
                            '2023-01-01 00:00:00', '2023-02-01 00:00:00',
                            '2023-03-01 00:00:00', '2023-04-01 00:00:00',
                            '2023-05-01 00:00:00', '2023-06-01 00:00:00', ], 
                   'value1': [0.452762281,0.372262281,0.513928948,0.447762281,
                              0.377095615,0.355095615,0.271428948,0.291762281,
                              0.476762281,0.335928948,0.280428948,0.283762281,
                              0.322928948,0.287262281,0.316928948,0.209262281,
                              0.407928948,0.254262281,0.232095615,0.264262281,
                              0.076095615,-0.025237719,-0.042237719,-0.094904385,
                              0.017428948,-0.036071052,-0.094071052,-0.071404385,
                              0.008095615,-0.141571052],
                   'value2': [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
                              6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
                              1.6,0.4,1.5,1.4,2.2,1.2]})
# cleaning data
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['date'], ignore_index=True, inplace=True)
# creat sub dataframe with august value only
aug = df.loc[df['date'].dt.month==8].copy()
aug.rename(columns={'value1': 'augval1', 'value2': 'augval2', 'date':'date_aug'}, inplace=True)
# create equivalent august date for merge
df['date_aug'] = np.where(df['date'].dt.month < 8, df['date'] + YearBegin(-1) + MonthBegin(7), df['date'] + YearBegin(1) + MonthBegin(7))
# merge aug dataframe to df dataframe
new = pd.merge(df, aug, on='date_aug', how='left')
# perform calculation
new['chg_val1'] = new['value1'].diff()
new['chg_aug'] = new['value1'] - new['augval1']

希望这有助于您的项目！

英文:

There are several ways to solve your issues. The solution depends on how your data might looks like

There are some flaws in your code too. Here is my suggestion

You don't need to set_index('date') here.

Here is:

df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])
df.sort_values(by=[&#39;date&#39;], ignore_index=True, inplace=True)

The first line is to convert your date column to datetime format, and the second is to sort your data in ascending order.

The most common way is to extract subsample with august only then merge those value back to your main dataframe and perform the calculation. This might be long, but it will shows any potential error if there is in your data

The complete solution:

import pandas as pd
import numpy as np
from dateutil.relativedelta import *
from pandas.tseries.offsets import *
df = pd.DataFrame({&#39;date&#39;: [&#39;2022-01-01 00:00:00&#39;, &#39;2022-02-01 00:00:00&#39;, 
&#39;2021-03-01 00:00:00&#39;, &#39;2021-04-01 00:00:00&#39;,
&#39;2021-05-01 00:00:00&#39;, &#39;2021-06-01 00:00:00&#39;,
&#39;2021-07-01 00:00:00&#39;, &#39;2021-08-01 00:00:00&#39;,
&#39;2021-09-01 00:00:00&#39;, &#39;2021-10-01 00:00:00&#39;,
&#39;2021-11-01 00:00:00&#39;, &#39;2021-12-01 00:00:00&#39;,
&#39;2022-01-01 00:00:00&#39;, &#39;2022-02-01 00:00:00&#39;,
&#39;2022-03-01 00:00:00&#39;, &#39;2022-04-01 00:00:00&#39;,
&#39;2022-05-01 00:00:00&#39;, &#39;2022-06-01 00:00:00&#39;,
&#39;2022-07-01 00:00:00&#39;, &#39;2022-08-01 00:00:00&#39;,
&#39;2022-09-01 00:00:00&#39;, &#39;2022-10-01 00:00:00&#39;,
&#39;2022-11-01 00:00:00&#39;, &#39;2023-12-01 00:00:00&#39;,
&#39;2023-01-01 00:00:00&#39;, &#39;2023-02-01 00:00:00&#39;,
&#39;2023-03-01 00:00:00&#39;, &#39;2023-04-01 00:00:00&#39;,
&#39;2023-05-01 00:00:00&#39;, &#39;2023-06-01 00:00:00&#39;,
], 
&#39;value1&#39;: [0.452762281,0.372262281,0.513928948,0.447762281,
0.377095615,0.355095615,0.271428948,0.291762281,
0.476762281,0.335928948,0.280428948,0.283762281,
0.322928948,0.287262281,0.316928948,0.209262281,
0.407928948,0.254262281,0.232095615,0.264262281,
0.076095615,-0.025237719,-0.042237719,-0.094904385,
0.017428948,-0.036071052,-0.094071052,-0.071404385,
0.008095615,-0.141571052],
&#39;value2&#39;: [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
1.6,0.4,1.5,1.4,2.2,1.2]})
# cleaning data
df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])
df.sort_values(by=[&#39;date&#39;], ignore_index=True, inplace=True)
# creat sub dataframe with august value only
aug = df.loc[df[&#39;date&#39;].dt.month==8].copy()
aug.rename(columns={&#39;value1&#39;: &#39;augval1&#39;, &#39;value2&#39;: &#39;augval2&#39;, &#39;date&#39;:&#39;date_aug&#39;}, inplace=True)
# create equivalent august date for merge
df[&#39;date_aug&#39;] = np.where(df[&#39;date&#39;].dt.month &lt; 8, df[&#39;date&#39;] + YearBegin(-1) + MonthBegin(7), df[&#39;date&#39;] + YearBegin(1) + MonthBegin(7))
# merge aug dataframe to df dataframe
new = pd.merge(df, aug, on=&#39;date_aug&#39;, how=&#39;left&#39;)
# perform calculation
new[&#39;chg_val1&#39;] = new[&#39;value1&#39;].diff()
new[&#39;chg_aug&#39;] = new[&#39;value1&#39;] - new[&#39;augval1&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何找出值从它们的周期开始时发生了怎样的变化？

问题

答案1

答案2

Python：网络爬虫 Pandas 数据框在数据之间返回多个空行

My code在从输入文本框获取数据到Python Flask时似乎有问题。

为什么 tkinter 没有显示任何内容？

How to sort array elements based on the closest occurrence of a sublist with a numeric value of 1.0? And then combine this sorted matrix with another

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。