英文:
How to average based on data range in a difference table in Pandas
问题
使用两个表,Values
和 dates
,我想要计算日期范围内的平均值。
Values
表如下所示:
日期和时间 | 值 |
---|---|
2023-01-01 10:00 | 1 |
2023-01-01 11:00 | 2 |
2023-01-02 10:00 | 4 |
2023-01-04 10:00 | 4 |
2023-01-07 10:00 | 4 |
dates
表如下所示:
分组 | 起始日期 | 结束日期 |
---|---|---|
1 | 2023-01-01 | 2023-01-05 |
2 | 2023-01-03 | 2023-01-10 |
正如您所见,日期范围可以重叠。
我试图计算这些范围内的平均值,所以在这个示例中,输出应该是类似以下的内容:
分组 | 起始日期 | 结束日期 | 平均值 |
---|---|---|---|
1 | 2023-01-01 | 2023-01-05 | 2.75 |
2 | 2023-01-03 | 2023-01-10 | 4 |
目前我的代码如下(一行代码):
Values.groupby(np.where(Values['Date'].between(Dates['StartDay'], Dates['EndDay']), 'pre', 'post'))['Value'].mean()
然而,这会导致以下错误:
ValueError: Can only compare identically-labeled Series objects
这是基于其他类似的问题编写的,但似乎不适用于此情况,因为它涉及两个表和日期范围。
英文:
With two tables, Values
and dates
, I would like to get the average value between the date ranges.
Values
looks like:
Date | Value |
---|---|
2023-01-01 10:00 | 1 |
2023-01-01 11:00 | 2 |
2023-01-02 10:00 | 4 |
2023-01-04 10:00 | 4 |
2023-01-07 10:00 | 4 |
and dates
looks like
Group | StartDay | EndDay |
---|---|---|
1 | 2023-01-01 | 2023-01-05 |
2 | 2023-01-03 | 2023-01-10 |
As you can see, the date ranges can overlap.
I am trying to calculate the averages over these ranges, so in this example the output should be something along the lines of
Group | StartDay | EndDay | Mean |
---|---|---|---|
1 | 2023-01-01 | 2023-01-05 | 2.75 |
2 | 2023-01-03 | 2023-01-10 | 4 |
Currently my code looks like (all one line):
Values.groupby(np.where(Values['Date'].between(Dates['StartDay'],Dates['EndDay']),'pre','post'))['value'].mean()
however this results in
ValueError: Can only compare identically-labeled Series objects
This was based on other similar questions, however does not appear to apply here due to it being over two tables / using ranges.
答案1
得分: 3
尝试:
# 将值转换为日期时间(如果尚未转换):
df1['Date'] = pd.to_datetime(df1['Date'])
df2['StartDay'] = pd.to_datetime(df2['StartDay'])
df2['EndDay'] = pd.to_datetime(df2['EndDay'])
df1 = df1.sort_values(by='Date').set_index('Date')
df2['Mean'] = df2.apply(lambda x: df1[x['StartDay']:x['EndDay']].mean(), axis=1)
print(df2)
打印结果:
Group StartDay EndDay Mean
0 1 2023-01-01 2023-01-05 2.75
1 2 2023-01-03 2023-01-10 4.00
英文:
Try:
# convert the values to datetime (if not already):
df1['Date'] = pd.to_datetime(df1['Date'])
df2['StartDay'] = pd.to_datetime(df2['StartDay'])
df2['EndDay'] = pd.to_datetime(df2['EndDay'])
df1 = df1.sort_values(by='Date').set_index('Date')
df2['Mean'] = df2.apply(lambda x: df1[x['StartDay']:x['EndDay']].mean(), axis=1)
print(df2)
Prints:
Group StartDay EndDay Mean
0 1 2023-01-01 2023-01-05 2.75
1 2 2023-01-03 2023-01-10 4.00
答案2
得分: 0
这似乎是某种形式的不等式连接 - 如果是这种情况,您可以使用conditional_join来从pyjanitor获取结果,然后进行分组 - 应该比apply
快,具体取决于数据大小:
# pip install pyjanitor
import pandas as pd
import janitor
df1 = pd.read_clipboard(sep="\t", engine='python', parse_dates=['Date'])
df2 = pd.read_clipboard(sep="\t", engine='python', parse_dates=['StartDay', 'EndDay'])
(df1
.conditional_join(
df2,
# column from left, column from right, comparator
('Date', 'StartDay', '>='),
('Date', 'EndDay', '<='),
# depending on the data size,
# you might get more performance with numba
use_numba = False,
df_columns = 'Value')
.groupby(df2.columns.tolist())
.Value
.mean()
)
Group StartDay EndDay
1 2023-01-01 2023-01-05 2.75
2 2023-01-03 2023-01-10 4.00
Name: Value, dtype: float64
英文:
Looks like some form of inequality join - if that is the case, you can use conditional_join from pyjanitor to get your results, before grouping - should be faster than apply
, depending on the data size:
# pip install pyjanitor
import pandas as pd
import janitor
df1 = pd.read_clipboard(sep="\t", engine='python', parse_dates = ['Date'])
df2 = pd.read_clipboard(sep="\t", engine='python', parse_dates = ['StartDay', 'EndDay'])
(df1
.conditional_join(
df2,
# column from left, column from right, comparator
('Date', 'StartDay', '>='),
('Date', 'EndDay', '<='),
# depending on the data size,
# you might get more performance with numba
use_numba = False,
df_columns = 'Value')
.groupby(df2.columns.tolist())
.Value
.mean()
)
Group StartDay EndDay
1 2023-01-01 2023-01-05 2.75
2 2023-01-03 2023-01-10 4.00
Name: Value, dtype: float64
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论