如何找出值从它们的周期开始时发生了怎样的变化?

huangapple go评论61阅读模式
英文:

How to find how the values have changed from the begining of their period?

问题

你可以尝试以下代码来实现你的目标:

# 转换日期列为日期时间类型
df['date'] = pd.to_datetime(df['date'])

# 找到每个周期的开始日期(每年的8月1日)
cycle_start_dates = df[df.index.month == 8]

# 使用reindex方法将周期开始日期应用到整个数据框
df_reindexed = df.reindex(cycle_start_dates.index, method='ffill')

# 计算每个月与其对应的周期开始日期的差异
df_difference = df - df_reindexed

# 打印结果
print(df_difference)

这段代码将计算每个月与其对应的周期开始日期的差异,以实现你想要的目标。

英文:

I have a dataframe with timeindex. There is a period (cycle) that starts every august. I want to calculate the difference between the value of each month and the value that was on the previous august (the beginning of its period).
The aim is to know how much the values have changed during each cycle.

This is an small example of the data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': ['2022-01-01 00:00:00', '2022-02-01 00:00:00', 
                          '2021-03-01 00:00:00', '2021-04-01 00:00:00',
                          '2021-05-01 00:00:00', '2021-06-01 00:00:00',
                          '2021-07-01 00:00:00', '2021-08-01 00:00:00',
                          '2021-09-01 00:00:00', '2021-10-01 00:00:00',
                          '2021-11-01 00:00:00', '2021-12-01 00:00:00',
                          '2022-01-01 00:00:00', '2022-02-01 00:00:00',
                          '2022-03-01 00:00:00', '2022-04-01 00:00:00',
                          '2022-05-01 00:00:00', '2022-06-01 00:00:00',
                          '2022-07-01 00:00:00', '2022-08-01 00:00:00',
                          '2022-09-01 00:00:00', '2022-10-01 00:00:00',
                          '2022-11-01 00:00:00', '2023-12-01 00:00:00',
                          '2023-01-01 00:00:00', '2023-02-01 00:00:00',
                          '2023-03-01 00:00:00', '2023-04-01 00:00:00',
                          '2023-05-01 00:00:00', '2023-06-01 00:00:00',
                          ], 
               'value1': [0.452762281,0.372262281,0.513928948,0.447762281,
                            0.377095615,0.355095615,0.271428948,0.291762281,
                            0.476762281,0.335928948,0.280428948,0.283762281,
                            0.322928948,0.287262281,0.316928948,0.209262281,
                            0.407928948,0.254262281,0.232095615,0.264262281,
                            0.076095615,-0.025237719,-0.042237719,-0.094904385,
                            0.017428948,-0.036071052,-0.094071052,-0.071404385,
                            0.008095615,-0.141571052],
               'value2': [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
                               6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
                               1.6,0.4,1.5,1.4,2.2,1.2]})
df = df.set_index('date')

I have tried to get a dataframe with the same structure but all the values have the value that was on its corresponding august, and then I just need to subtract one dataframe from the other.
But I couldn't find a way to do it neither.

Thank you vary much in advance.

答案1

得分: 0

如果我理解您的意思正确的话,您可以使用.groupby()然后计算每个组的差异:

df.index = pd.to_datetime(df.index)

tmp = (df.index.month == 8).cumsum()
out = (
    df[tmp != 0]
    .groupby(tmp[tmp > 0], group_keys=False)
    .apply(lambda x: x[["value1", "value2"]] - x[["value1", "value2"]].iloc[0])
).add_prefix('calculated_')

x = df.join(out)
print(x)

打印结果:

              value1  value2  calculated_value1  calculated_value2
date                                                              
2021-03-01  0.513929     8.4                NaN                NaN
2021-04-01  0.447762     6.2                NaN                NaN
2021-05-01  0.377096     6.2                NaN                NaN
2021-06-01  0.355096     6.0                NaN                NaN
2021-07-01  0.271429     3.9                NaN                NaN
2021-08-01  0.291762     8.5           0.000000                0.0
2021-09-01  0.476762     8.3           0.185000               -0.2
2021-10-01  0.335929     5.3           0.044167               -3.2
2021-11-01  0.280429     5.6          -0.011333               -2.9
2021-12-01  0.283762     5.3          -0.008000               -3.2
2022-01-01  0.452762     9.6           0.031167               -2.3
2022-01-01  0.322929     6.2           0.031167               -2.3
2022-02-01  0.372262     8.0          -0.004500               -2.2
2022-02-01  0.287262     6.3          -0.004500               -2.2
2022-03-01  0.316929     6.9           0.025167               -1.6
2022-04-01  0.209262     4.8          -0.082500               -3.7
2022-05-01  0.407929     6.7           0.116167               -1.8
2022-06-01  0.254262     3.6          -0.037500               -4.9
2022-07-01  0.232096     3.0          -0.059667               -5.5
2022-08-01  0.264262     4.6           0.000000                0.0
2022-09-01  0.076096     2.3          -0.188167               -2.3
2022-10-01 -0.025238     1.3          -0.289500               -3.3
2022-11-01 -0.042238     1.0          -0.306500               -3.6
2023-01-01  0.017429     1.6          -0.246833               -3.0
2023-02-01 -0.036071     0.4          -0.300333               -4.2
2023-03-01 -0.094071     1.5          -0.358333               -3.1
2023-04-01 -0.071404     1.4          -0.335667               -3.2
2023-05-01  0.008096     2.2          -0.256167               -2.4
2023-06-01 -0.141571     1.2          -0.405833               -3.4
2023-12-01 -0.094904     0.3          -0.359167               -4.3
英文:

If I understand you correctly, you can use .groupby() and then calculate the difference in each group:

df.index = pd.to_datetime(df.index)

tmp = (df.index.month == 8).cumsum()
out = (
    df[tmp != 0]
    .groupby(tmp[tmp > 0], group_keys=False)
    .apply(lambda x: x[["value1", "value2"]] - x[["value1", "value2"]].iloc[0])
).add_prefix('calculated_')

x = df.join(out)
print(x)

Prints:

              value1  value2  calculated_value1  calculated_value2
date                                                              
2021-03-01  0.513929     8.4                NaN                NaN
2021-04-01  0.447762     6.2                NaN                NaN
2021-05-01  0.377096     6.2                NaN                NaN
2021-06-01  0.355096     6.0                NaN                NaN
2021-07-01  0.271429     3.9                NaN                NaN
2021-08-01  0.291762     8.5           0.000000                0.0
2021-09-01  0.476762     8.3           0.185000               -0.2
2021-10-01  0.335929     5.3           0.044167               -3.2
2021-11-01  0.280429     5.6          -0.011333               -2.9
2021-12-01  0.283762     5.3          -0.008000               -3.2
2022-01-01  0.452762     9.6           0.031167               -2.3
2022-01-01  0.322929     6.2           0.031167               -2.3
2022-02-01  0.372262     8.0          -0.004500               -2.2
2022-02-01  0.287262     6.3          -0.004500               -2.2
2022-03-01  0.316929     6.9           0.025167               -1.6
2022-04-01  0.209262     4.8          -0.082500               -3.7
2022-05-01  0.407929     6.7           0.116167               -1.8
2022-06-01  0.254262     3.6          -0.037500               -4.9
2022-07-01  0.232096     3.0          -0.059667               -5.5
2022-08-01  0.264262     4.6           0.000000                0.0
2022-09-01  0.076096     2.3          -0.188167               -2.3
2022-10-01 -0.025238     1.3          -0.289500               -3.3
2022-11-01 -0.042238     1.0          -0.306500               -3.6
2023-01-01  0.017429     1.6          -0.246833               -3.0
2023-02-01 -0.036071     0.4          -0.300333               -4.2
2023-03-01 -0.094071     1.5          -0.358333               -3.1
2023-04-01 -0.071404     1.4          -0.335667               -3.2
2023-05-01  0.008096     2.2          -0.256167               -2.4
2023-06-01 -0.141571     1.2          -0.405833               -3.4
2023-12-01 -0.094904     0.3          -0.359167               -4.3

答案2

得分: 0

以下是您要翻译的内容:

There are several ways to solve your issues. The solution depends on how your data might looks like

There are some flaws in your code too. Here is my suggestion

  1. You don't need to set_index('date') here.

Here is:

df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['date'], ignore_index=True, inplace=True)

The first line is to convert your date column to datetime format, and the second is to sort your data in ascending order.

  1. The most common way is to extract subsample with august only then merge those value back to your main dataframe and perform the calculation. This might be long, but it will shows any potential error if there is in your data

The complete solution:

import pandas as pd
import numpy as np
from dateutil.relativedelta import *
from pandas.tseries.offsets import *

df = pd.DataFrame({'date': ['2022-01-01 00:00:00', '2022-02-01 00:00:00', 
                            '2021-03-01 00:00:00', '2021-04-01 00:00:00',
                            '2021-05-01 00:00:00', '2021-06-01 00:00:00',
                            '2021-07-01 00:00:00', '2021-08-01 00:00:00',
                            '2021-09-01 00:00:00', '2021-10-01 00:00:00',
                            '2021-11-01 00:00:00', '2021-12-01 00:00:00',
                            '2022-01-01 00:00:00', '2022-02-01 00:00:00',
                            '2022-03-01 00:00:00', '2022-04-01 00:00:00',
                            '2022-05-01 00:00:00', '2022-06-01 00:00:00',
                            '2022-07-01 00:00:00', '2022-08-01 00:00:00',
                            '2022-09-01 00:00:00', '2022-10-01 00:00:00',
                            '2022-11-01 00:00:00', '2023-12-01 00:00:00',
                            '2023-01-01 00:00:00', '2023-02-01 00:00:00',
                            '2023-03-01 00:00:00', '2023-04-01 00:00:00',
                            '2023-05-01 00:00:00', '2023-06-01 00:00:00', ], 
                   'value1': [0.452762281,0.372262281,0.513928948,0.447762281,
                              0.377095615,0.355095615,0.271428948,0.291762281,
                              0.476762281,0.335928948,0.280428948,0.283762281,
                              0.322928948,0.287262281,0.316928948,0.209262281,
                              0.407928948,0.254262281,0.232095615,0.264262281,
                              0.076095615,-0.025237719,-0.042237719,-0.094904385,
                              0.017428948,-0.036071052,-0.094071052,-0.071404385,
                              0.008095615,-0.141571052],
                   'value2': [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
                              6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
                              1.6,0.4,1.5,1.4,2.2,1.2]})

# cleaning data
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['date'], ignore_index=True, inplace=True)

# creat sub dataframe with august value only
aug = df.loc[df['date'].dt.month==8].copy()
aug.rename(columns={'value1': 'augval1', 'value2': 'augval2', 'date':'date_aug'}, inplace=True)

# create equivalent august date for merge
df['date_aug'] = np.where(df['date'].dt.month < 8, df['date'] + YearBegin(-1) + MonthBegin(7), df['date'] + YearBegin(1) + MonthBegin(7))

# merge aug dataframe to df dataframe
new = pd.merge(df, aug, on='date_aug', how='left')

# perform calculation
new['chg_val1'] = new['value1'].diff()
new['chg_aug'] = new['value1'] - new['augval1']

希望这有助于您的项目!

英文:

There are several ways to solve your issues. The solution depends on how your data might looks like

There are some flaws in your code too. Here is my suggestion

  1. You don't need to set_index(&#39;date&#39;) here.

Here is:

df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])
df.sort_values(by=[&#39;date&#39;], ignore_index=True, inplace=True)

The first line is to convert your date column to datetime format, and the second is to sort your data in ascending order.

  1. The most common way is to extract subsample with august only then merge those value back to your main dataframe and perform the calculation. This might be long, but it will shows any potential error if there is in your data

The complete solution:

import pandas as pd
import numpy as np
from dateutil.relativedelta import *
from pandas.tseries.offsets import *
df = pd.DataFrame({&#39;date&#39;: [&#39;2022-01-01 00:00:00&#39;, &#39;2022-02-01 00:00:00&#39;, 
&#39;2021-03-01 00:00:00&#39;, &#39;2021-04-01 00:00:00&#39;,
&#39;2021-05-01 00:00:00&#39;, &#39;2021-06-01 00:00:00&#39;,
&#39;2021-07-01 00:00:00&#39;, &#39;2021-08-01 00:00:00&#39;,
&#39;2021-09-01 00:00:00&#39;, &#39;2021-10-01 00:00:00&#39;,
&#39;2021-11-01 00:00:00&#39;, &#39;2021-12-01 00:00:00&#39;,
&#39;2022-01-01 00:00:00&#39;, &#39;2022-02-01 00:00:00&#39;,
&#39;2022-03-01 00:00:00&#39;, &#39;2022-04-01 00:00:00&#39;,
&#39;2022-05-01 00:00:00&#39;, &#39;2022-06-01 00:00:00&#39;,
&#39;2022-07-01 00:00:00&#39;, &#39;2022-08-01 00:00:00&#39;,
&#39;2022-09-01 00:00:00&#39;, &#39;2022-10-01 00:00:00&#39;,
&#39;2022-11-01 00:00:00&#39;, &#39;2023-12-01 00:00:00&#39;,
&#39;2023-01-01 00:00:00&#39;, &#39;2023-02-01 00:00:00&#39;,
&#39;2023-03-01 00:00:00&#39;, &#39;2023-04-01 00:00:00&#39;,
&#39;2023-05-01 00:00:00&#39;, &#39;2023-06-01 00:00:00&#39;,
], 
&#39;value1&#39;: [0.452762281,0.372262281,0.513928948,0.447762281,
0.377095615,0.355095615,0.271428948,0.291762281,
0.476762281,0.335928948,0.280428948,0.283762281,
0.322928948,0.287262281,0.316928948,0.209262281,
0.407928948,0.254262281,0.232095615,0.264262281,
0.076095615,-0.025237719,-0.042237719,-0.094904385,
0.017428948,-0.036071052,-0.094071052,-0.071404385,
0.008095615,-0.141571052],
&#39;value2&#39;: [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
1.6,0.4,1.5,1.4,2.2,1.2]})
# cleaning data
df[&#39;date&#39;] = pd.to_datetime(df[&#39;date&#39;])
df.sort_values(by=[&#39;date&#39;], ignore_index=True, inplace=True)
# creat sub dataframe with august value only
aug = df.loc[df[&#39;date&#39;].dt.month==8].copy()
aug.rename(columns={&#39;value1&#39;: &#39;augval1&#39;, &#39;value2&#39;: &#39;augval2&#39;, &#39;date&#39;:&#39;date_aug&#39;}, inplace=True)
# create equivalent august date for merge
df[&#39;date_aug&#39;] = np.where(df[&#39;date&#39;].dt.month &lt; 8, df[&#39;date&#39;] + YearBegin(-1) + MonthBegin(7), df[&#39;date&#39;] + YearBegin(1) + MonthBegin(7))
# merge aug dataframe to df dataframe
new = pd.merge(df, aug, on=&#39;date_aug&#39;, how=&#39;left&#39;)
# perform calculation
new[&#39;chg_val1&#39;] = new[&#39;value1&#39;].diff()
new[&#39;chg_aug&#39;] = new[&#39;value1&#39;] - new[&#39;augval1&#39;]

huangapple
  • 本文由 发表于 2023年6月12日 16:13:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76454704.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定