英文:
Pandas DataFrame grouping by custom dates (quarters)
问题
我正在处理包含网站点击和每天分钟数值的数据框,我需要按季度对它们进行分组(和求和),但有两个注意事项:
- 季度的起始日期是非标准的(例如,Q1 是 2 月 27 日至 5 月 26 日...)
- 对于不同的网站,我有各种日期范围,因此我希望能够指定一个日期和月份(与年份无关),并根据此进行分组。
请查看下面一种复制我正在使用的数据框类型的方法:
import random
import pandas as pd
date_range = pd.date_range(start='2018-1-1', end='2022-10-03')
daily_views = [random.randint(1000,9999) for i in range(len(date_range))]
daily_minutes = [random.randint(1000,9999) for i in range(len(date_range))]
df = pd.DataFrame(
{'DailyViews': daily_views, 'DailyMinutes': daily_minutes},
index=date_range,
)
到目前为止,我尝试过 df_grouped = df.groupby(df.index.shift(freq='Q')).sum()
,但无法使偏移量与之配合使用,因为它似乎要求按季度的数量进行偏移,而我希望有更精细的控制。
我还尝试过 df_resampled = df.resample('Q', convention='end', offset=datetime.timedelta(days=25)).sum()
,但更改偏移量似乎不会影响输出。
目前正在尝试在手动计算给定输入的季度之前,通过 .apply()
在执行 pd.todatetime()
和分组之前手动检查每一行。但这感觉非常低效和冗长,肯定有一种更简单/更优雅的方式来得到答案?
非常感谢!
EDIT,临时解决方案:
我实施了一个临时解决方案,允许我以一个月的增量更改季度的开始日期:
quarter_start_month = 2
month_to_quarter_mapping = {
i + 1: (12 + ((i - (quarter_start_month - 1) % 3) // 3)) % 4 + 1 for i in range(12)
}
df["QMap"] = df.index.month.map(month_to_quarter_mapping)
df["YMap"] = np.where(
(df.QMap == 4) & (df.index.month.to_series(index=df.index) < 5),
df.index.year.map(lambda x: x - 1),
df.index.year)
df["Quarter"] = df.YMap.astype(str) + "Q" + df.QMap.astype(str)
df.drop(columns=["QMap", "YMap"], inplace=True)
df = df.groupby("Quarter").sum()
基本上是一种非常手动的方式,将日期作为字符串提取季度,然后通过该列进行分组。
缺点是我被困在了使用字符串索引(即 "2020Q1"
),因为如果我尝试转换回 pd.to_datetime(df.index)
,它会将季度解释为标准季度,并将标准季度的起始日期放到实际上不在修改后的季度中的地方。
奖励问题:
如果有人知道一个特定的命令,可以使数据框显示为“2022Q1”而不是“2022-01-01”,那将非常有帮助。
英文:
I am working with DataFrames with values of website hits and minutes per day, I need to group (and sum) them by quarter but with two caveats:
-Quarters start date is non standard (i.e Q1 is 27th Feb to 26th May...)
-I have a wide range of dates for different websites so I want to be able to specify a day and month (irrespective of year) and have the df grouped accordingly.
Please find below a way to replicate the type of DataFrame I am working with
import random
import pandas as pd
date_range = pd.date_range(start='2018-1-1', end='2022-10-03')
daily_views = [random.randint(1000,9999) for i in range(len(date_range))]
daily_minutes = [random.randint(1000,9999) for i in range(len(date_range))]
df = pd.DataFrame(
{'DailyViews': daily_views, 'DailyMinutes': daily_minutes},
index=date_range,
)
So far have tried df_grouped = df.groupby(df.index.shift(freq='Q')).sum()
but could not get the offset to work with this as it seems to want to offset by a number of quarters and I am looking for finer control.
Have also tried df_resampled = df.resample('Q', convention='end', offset=datetime.timedelta(days=25)).sum()
but changing the offset does not seem to affect the output.
Currently trying to manually compute the quarters given the input and manually checking each row (with a .apply()
) before doing pd.todatetime()
and grouping. But this feels very inefficient and long winded and there must be a simpler/more elegant way to get to the answer?
Any help would be greatly appreciated!
Many Thanks
EDIT, TEMPORARY SOLUTION:
I have implemented a temporary solution to allow me to change the starting date of quarters by increments of one month:
quarter_start_month = 2
month_to_quarter_mapping = {
i + 1: (12 + ((i - (quarter_start_month - 1) % 3) // 3)) % 4 + 1 for i in range(12)
}
df["QMap"] = df.index.month.map(month_to_quarter_mapping)
df["YMap"] = np.where(
(df.QMap == 4) & (df.index.month.to_series(index=df.index) < 5),
df.index.year.map(lambda x: x - 1),
df.index.year)
df["Quarter"] = df.YMap.astype(str) + "Q" + df.QMap.astype(str)
df.drop(columns=["QMap", "YMap"], inplace=True)
df = df.groupby("Quarter").sum()
Basically a very manual way of extracting what quarter a date is in as a string and then grouping via that column.
Caveat is that I am stuck indexing with a string (i.e "2020Q1"
) as if I try to convert back pd.to_datetime(df.index)
it will interpret the quarters as standard quarters and put the start date of the normal quarter even when it is not actually in the modified quarter
BONUS QUESTION:
If anyone knows a specific command to have the dataframe display "2022Q1" as opposed to "2022-01-01" that would be very helpful
答案1
得分: 1
我终于想出了一个解决方案,我的内心得到了安宁!
我不太确定我理解你的第二个注意事项,所以我假设你想要独立于整个数据集按季度分组,而最困难的部分可能是无论如何都要基于自定义季度进行分组(因为许多pandas的方法会将月/季度的开始/结束四舍五入,正如你所说,偏移似乎不起作用,我们想要的方式)。
# 将这些自定义季度定义为一系列间隔,右边的箱子被排除:
# 我选择了一个简单的解决方案,通过检查数据的最小值和最大值年份来保证完全重叠。这可能不是最佳选择,因为一些间隔可能是多余的。
starting_year = df.index.min().year - 1
nb_periods = (df.index.max().year - starting_year) * 4 + 1
quarters_intervals = pd.interval_range(start=pd.Timestamp(f'{starting_year}-11-27'),
freq=pd.offsets.DateOffset(months=3),
periods=nb_periods, closed='left')
# 将日期按照自定义间隔排序:
df['QuarterInterval'] = pd.cut(df.index.to_series(), bins=quarters_intervals)
# 我不知道为什么,但是在提供IntervalIndex作为bins时,无法在pd.cut()中使用标签,所以现在让我们映射到季度名称:
quarters_labels = [f'Q{(i + 3) % 4 + 1}' for i in range(len(quarters_intervals))]
mapper = {I:Q for I, Q in zip(quarters_intervals, quarters_labels)}
df['Quarter'] = df['QuarterInterval'].map(mapper)
# 快速检查:
print(df.sample(5))
输出:
DailyViews DailyMinutes QuarterInterval Quarter
2020-12-09 3054 7496 [2020-11-27, 2021-02-27) Q4
2022-04-30 9396 5273 [2022-02-27, 2022-05-27) Q1
2022-02-07 2076 7088 [2021-11-27, 2022-02-27) Q4
2019-10-25 9506 5835 [2019-08-27, 2019-11-27) Q3
2018-09-16 2001 6334 [2018-08-27, 2018-11-27) Q3
最后...
# 让我们得到我们的季度分组统计数据:
print(df.groupby('Quarter')[['DailyViews', 'DailyMinutes']].sum())
输出:
DailyViews DailyMinutes
Quarter
Q1 2413595 2420882
Q2 2541627 2533811
Q3 2226040 2212354
Q4 2268091 2320683
英文:
I finally came up with a solution, my peace of mind is safe !
I am not really sure I understood your second caveat, so I went as if you wanted to group by quarter independently of the year for your whole dataset, but the hardest part was probably to group based on custom quarters anyway (as a lots of pandas' methods round up to the start/end of the month/quarter and the offset, as you said, does not seem to work as we would like to).
# Define those custom quarters as a range of intervals, right bin being excluded:
# I went for an easy solution to guarantee full overlapping by checking data min
# and max year. This is suboptimal as some intervals are possibly useless)
starting_year = df.index.min().year - 1
nb_periods = (df.index.max().year - starting_year) * 4 + 1
quarters_intervals = pd.interval_range(start=pd.Timestamp(f'{starting_year}-11-27'),
freq=pd.offsets.DateOffset(months=3),
periods=nb_periods, closed='left')
# Sort the dates into the custom intervals:
df['QuarterInterval'] = pd.cut(df.index.to_series(), bins=quarters_intervals)
# I don't know why, but it is not possible to use the labels to pd.cut() when
# providing bins as IntervalIndex, so let's map to the quarter name now:
quarters_labels = [f'Q{(i + 3) % 4 + 1}' for i in range(len(quarters_intervals))]
mapper = {I:Q for I, Q in zip(quarters_intervals, quarters_labels)}
df['Quarter'] = df['QuarterInterval'].map(mapper)
# Quick check:
print(df.sample(5))
Output:
DailyViews DailyMinutes QuarterInterval Quarter
2020-12-09 3054 7496 [2020-11-27, 2021-02-27) Q4
2022-04-30 9396 5273 [2022-02-27, 2022-05-27) Q1
2022-02-07 2076 7088 [2021-11-27, 2022-02-27) Q4
2019-10-25 9506 5835 [2019-08-27, 2019-11-27) Q3
2018-09-16 2001 6334 [2018-08-27, 2018-11-27) Q3
and finally...
# Let's get our quarterly grouped statistics:
print(df.groupby('Quarter')[['DailyViews', 'DailyMinutes']].sum())
Output:
DailyViews DailyMinutes
Quarter
Q1 2413595 2420882
Q2 2541627 2533811
Q3 2226040 2212354
Q4 2268091 2320683
答案2
得分: -1
所以在 Pandas 文档中搜索了很多后,我相信我已经找到了解决你难题的办法。
以下是我的代码
from pandas.tseries.offsets import QuarterBegin
from pandas.tseries.frequencies import to_offset
month = 1
days = 15
df_resampled = df.resample(QuarterBegin(startingMonth=month)).sum()
df_resampled.index = df_resampled.index + to_offset(f"{days}D")
所以,我改变的第一件事是时间序列索引重采样中使用的规则。QuarterBegin 类是 Pandas 中的一个规则,它在季度开始日期之间增加,这里 是关于它的文档。你可以更改的属性之一是季度偏移开始的月份,可以轻松地用一个变量更改。对于日偏移,我最近找到的一种方式在这里 ,它在底部修复了一个名为 loffset 的已弃用参数,可以在时间序列的左侧进行偏移。你必须使用 to_offset 函数添加该偏移,该函数接受字符串。我使用的字符串是我的代码将一个 days 变量附加到大写字母 "D",它代表天数。
我还找到了一种更不精确的方法,如果你只关心月份,可以使用以下代码:
from pandas.tseries.offsets import QuarterBegin
month = 3
df_grouped = df.groupby(df.index.shift(freq=QuarterBegin(startingMonth=month))).sum()
希望对你有帮助!
英文:
So after a lot of searching through the Pandas documentation, I believe I have found a solution to your conundrum.
Here is my code
from pandas.tseries.offsets import QuarterBegin
from pandas.tseries.frequencies import to_offset
month = 1
days = 15
df_resampled = df.resample(QuarterBegin(startingMonth=month)).sum()
df_resampled.index = df_resampled.index + to_offset(f"{days}D")
So, first thing I changed was the rule used in the resample of the timeseries index. The QuarterBegin class is a rule in Pandas that increments between quarter start dates, here is the documentation for that. One of the attributes you can change is the month the quarter offset starts at, which can easily be changed with a variable. For the day offset, the most recent way I found here all the way at the bottom where it fixes a deprecated parameter named loffset, can make an offset to the left of the timeseries. You have to add that offset using the to_offset function, which takes in strings. The string I'm using is my code attaches a days variable to the capital letter "D" which stands for days.
I also found less accurate way of doing it if you only cared about the month using this code:
from pandas.tseries.offsets import QuarterBegin
month = 3
df_grouped = df.groupby(df.index.shift(freq=QuarterBegin(startingMonth=month))).sum()
Hope this helps!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论