2023年2月23日 22:59:02go评论89阅读模式

英文:

How to average based on data range in a difference table in Pandas

问题

使用两个表，Values 和 dates，我想要计算日期范围内的平均值。
Values 表如下所示：

日期和时间	值
2023-01-01 10:00	1
2023-01-01 11:00	2
2023-01-02 10:00	4
2023-01-04 10:00	4
2023-01-07 10:00	4

dates 表如下所示：

分组	起始日期	结束日期
1	2023-01-01	2023-01-05
2	2023-01-03	2023-01-10

正如您所见，日期范围可以重叠。

我试图计算这些范围内的平均值，所以在这个示例中，输出应该是类似以下的内容：

分组	起始日期	结束日期	平均值
1	2023-01-01	2023-01-05	2.75
2	2023-01-03	2023-01-10	4

目前我的代码如下（一行代码）：

Values.groupby(np.where(Values['Date'].between(Dates['StartDay'], Dates['EndDay']), 'pre', 'post'))['Value'].mean()

然而，这会导致以下错误：

ValueError: Can only compare identically-labeled Series objects

这是基于其他类似的问题编写的，但似乎不适用于此情况，因为它涉及两个表和日期范围。

英文:

With two tables, Values and dates, I would like to get the average value between the date ranges.
Values looks like:

Date	Value
2023-01-01 10:00	1
2023-01-01 11:00	2
2023-01-02 10:00	4
2023-01-04 10:00	4
2023-01-07 10:00	4

and dates looks like

Group	StartDay	EndDay
1	2023-01-01	2023-01-05
2	2023-01-03	2023-01-10

As you can see, the date ranges can overlap.

I am trying to calculate the averages over these ranges, so in this example the output should be something along the lines of

Group	StartDay	EndDay	Mean
1	2023-01-01	2023-01-05	2.75
2	2023-01-03	2023-01-10	4

Currently my code looks like (all one line):

Values.groupby(np.where(Values['Date'].between(Dates['StartDay'],Dates['EndDay']),'pre','post'))['value'].mean()

however this results in
ValueError: Can only compare identically-labeled Series objects

This was based on other similar questions, however does not appear to apply here due to it being over two tables / using ranges.

答案1

得分: 3

尝试：

# 将值转换为日期时间（如果尚未转换）：
df1['Date'] = pd.to_datetime(df1['Date'])
df2['StartDay'] = pd.to_datetime(df2['StartDay'])
df2['EndDay'] = pd.to_datetime(df2['EndDay'])
df1 = df1.sort_values(by='Date').set_index('Date')
df2['Mean'] = df2.apply(lambda x: df1[x['StartDay']:x['EndDay']].mean(), axis=1)
print(df2)

打印结果：

   Group   StartDay     EndDay  Mean
0      1 2023-01-01 2023-01-05  2.75
1      2 2023-01-03 2023-01-10  4.00

英文:

Try:

# convert the values to datetime (if not already):
df1[&#39;Date&#39;] = pd.to_datetime(df1[&#39;Date&#39;])
df2[&#39;StartDay&#39;] = pd.to_datetime(df2[&#39;StartDay&#39;])
df2[&#39;EndDay&#39;] = pd.to_datetime(df2[&#39;EndDay&#39;])
df1 = df1.sort_values(by=&#39;Date&#39;).set_index(&#39;Date&#39;)
df2[&#39;Mean&#39;] = df2.apply(lambda x: df1[x[&#39;StartDay&#39;]:x[&#39;EndDay&#39;]].mean(), axis=1)
print(df2)

Prints:

   Group   StartDay     EndDay  Mean
0      1 2023-01-01 2023-01-05  2.75
1      2 2023-01-03 2023-01-10  4.00

答案2

得分: 0

这似乎是某种形式的不等式连接 - 如果是这种情况，您可以使用conditional_join来从pyjanitor获取结果，然后进行分组 - 应该比apply快，具体取决于数据大小：

# pip install pyjanitor
import pandas as pd
import janitor
df1 = pd.read_clipboard(sep="\t", engine='python', parse_dates=['Date'])
df2 = pd.read_clipboard(sep="\t", engine='python', parse_dates=['StartDay', 'EndDay'])
(df1
.conditional_join(
    df2, 
    # column from left, column from right, comparator
    ('Date', 'StartDay', '>='),  
    ('Date', 'EndDay', '<='), 
    # depending on the data size,
    # you might get more performance with numba
    use_numba = False,
    df_columns = 'Value')
.groupby(df2.columns.tolist())
.Value
.mean()
)
Group  StartDay    EndDay    
1      2023-01-01  2023-01-05    2.75
2      2023-01-03  2023-01-10    4.00
Name: Value, dtype: float64

英文:

Looks like some form of inequality join - if that is the case, you can use conditional_join from pyjanitor to get your results, before grouping - should be faster than apply, depending on the data size:

# pip install pyjanitor
import pandas as pd
import janitor
df1 = pd.read_clipboard(sep=&quot;\t&quot;, engine=&#39;python&#39;, parse_dates = [&#39;Date&#39;])
df2 = pd.read_clipboard(sep=&quot;\t&quot;, engine=&#39;python&#39;, parse_dates = [&#39;StartDay&#39;, &#39;EndDay&#39;])
(df1
.conditional_join(
    df2, 
    # column from left, column from right, comparator
    (&#39;Date&#39;, &#39;StartDay&#39;, &#39;&gt;=&#39;),  
    (&#39;Date&#39;, &#39;EndDay&#39;, &#39;&lt;=&#39;), 
    # depending on the data size,
    # you might get more performance with numba
    use_numba = False,
    df_columns = &#39;Value&#39;)
.groupby(df2.columns.tolist())
.Value
.mean()
)
Group  StartDay    EndDay    
1      2023-01-01  2023-01-05    2.75
2      2023-01-03  2023-01-10    4.00
Name: Value, dtype: float64

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas中如何根据数据范围在差异表中进行平均。

问题

答案1

答案2

在for循环中正确的语法

TypeVar在Python中是一个类型提示工具，用于定义通用类型变量。

Paper.js没有绘制到画布。

如何在discord.py按钮类中更新变量？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。