2023年5月7日 17:30:12go评论68阅读模式

英文:

Pandas groupby time gaps

问题

我有一个名为df_merged的数据框，输出如下所示：

index	count_date	time_gap	device	count	average_speed
0	2018-12-05	0 days 00:00:00	CAT17	0	0
1	2018-12-05	0 days 00:00:00	CAT17	0	0
2	2018-12-05	0 days 00:00:00	CAT17	0	0
3	2018-12-05	0 days 00:00:00	CAT17	0	0
4	2018-12-05	0 days 01:00:00	CAT17	0	0
...	...	...	...	...	...
154747	2023-05-04	0 days 22:00:00	CAT17	0	0
154748	2023-05-04	0 days 23:00:00	CAT17	4	16
154749	2023-05-04	0 days 23:00:00	CAT17	0	0
154750	2023-05-04	0 days 23:00:00	CAT17	1	13
154751	2023-05-04	0 days 23:00:00	CAT17	3	17

这是关于这个数据框的信息：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154752 entries, 0 to 154751
Data columns (total 5 columns):
Column Non-Null Count Dtype
0 count_date 154752 non-null datetime64[ns]
1 time_gap 0 non-null category
2 device 154752 non-null object
3 count 154752 non-null int32
4 average_speed 154752 non-null int32
dtypes: category(1), datetime64ns, int32(2), object(1)
memory usage: 3.7+ MB

time_gap是一个类别因为我使用了pd.cut()函数替换它。我不知道在这里是否最好更改dtype以进行分组。

我想要按time_gap分组，知道average_speed必须按照以下函数的计数加权：

def average_speed_mean(x):
    try: 
        return np.average(x["average_speed"], weights=x["count"])
    except ZeroDivisionError:
        return 0

我尝试过以下操作：

df_merged = df_merged.groupby("time_gap").agg({("count", sum), ("average_speed", average_speed_mean)})

但似乎不起作用，我不知道如何解决这个问题。

谢谢你的帮助。

英文:

I have a dataframe called df_merged which outputs like below:

index	count_date	time_gap	device	count	average_speed
0	2018-12-05	0 days 00:00:00	CAT17	0	0
1	2018-12-05	0 days 00:00:00	CAT17	0	0
2	2018-12-05	0 days 00:00:00	CAT17	0	0
3	2018-12-05	0 days 00:00:00	CAT17	0	0
4	2018-12-05	0 days 01:00:00	CAT17	0	0
...	...	...	...	...	...
154747	2023-05-04	0 days 22:00:00	CAT17	0	0
154748	2023-05-04	0 days 23:00:00	CAT17	4	16
154749	2023-05-04	0 days 23:00:00	CAT17	0	0
154750	2023-05-04	0 days 23:00:00	CAT17	1	13
154751	2023-05-04	0 days 23:00:00	CAT17	3	17

Here is the info on this df:

&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 154752 entries, 0 to 154751
Data columns (total 5 columns):

     Column         Non-Null Count   Dtype                  
 0   count_date     154752 non-null  datetime64[ns]
 1   time_gap       0 non-null       category      
 2   device         154752 non-null  object        
 3   count          154752 non-null  int32         
 4   average_speed  154752 non-null  int32         
dtypes: category(1), datetime64[ns](1), int32(2), object(1)
memory usage: 3.7+ MB

time_gap is a category because I replaced it with a pd.cut() function. I don't know if it's best to change the dtype here to groupby.

I would like to groupby time_gap, knowing that the average_speed has to be weighted by count with this function:

def average_speed_mean(x):
    try: 
        return np.average(x[&quot;average_speed&quot;], weights=x[&quot;count&quot;])
    except ZeroDivisionError:
        return 0

I'm trying to have basically the same dataframe grouped like this:

index	count_date	time_gap	device	count	average_speed
0	2018-12-05	0 days 00:00:00	CAT17	0	0
1	2018-12-05	0 days 01:00:00	CAT17	0	0
...	...	...	...	...	...
38687	2023-05-04	0 days 22:00:00	CAT17	0	0
38688	2023-05-04	0 days 23:00:00	CAT17	8	16

I tried this:

df_merged = df_merged.groupby(&quot;time_gap&quot;).agg({(&quot;count&quot;, sum), (&quot;average_speed&quot;, average_speed_mean)})

But it doesn't seem to work out and I have no idea how I could solve this.

Thank you in advance for your help.

答案1

得分: 1

以下是您要翻译的代码部分：

groups = ["count_date", "time_gap", "device"]
result = (
    df_merged
    .assign(average_speed=df_merged["average_speed"] * df_merged["count"])
    .groupby(groups).agg({"count": "sum", "average_speed": "sum"})
    .assign(average_speed=lambda df: df["average_speed"].div(df["count"]).fillna(0))
    .reset_index()
)

结果的小样本如下所示：

   count_date        time_gap device  count  average_speed
0  2018-12-05 0 days 00:00:00  CAT17      0            0.0
1  2018-12-05 0 days 01:00:00  CAT17      0            0.0
2  2023-05-04 0 days 22:00:00  CAT17      0            0.0
3  2023-05-04 0 days 23:00:00  CAT17      8           16.0

请注意，我只提供了代码和结果的翻译，没有其他内容。

英文:

Does the following give you the result you want:

groups = [&quot;count_date&quot;, &quot;time_gap&quot;, &quot;device&quot;]
result = (
    df_merged
    .assign(average_speed=df_merged[&quot;average_speed&quot;] * df_merged[&quot;count&quot;])
    .groupby(groups).agg({&quot;count&quot;: &quot;sum&quot;, &quot;average_speed&quot;: &quot;sum&quot;})
    .assign(average_speed=lambda df: df[&quot;average_speed&quot;].div(df[&quot;count&quot;]).fillna(0))
    .reset_index()
)

groups is the list of columns I think you want to group over - adjust it, if that's not the case.
Multiply the average_speed by the count.
Now group over the groups and sum up the counts and average_speeds per group.
Then divide the summed up average_speed by the total count and fill then NaN with 0.
Finally reset the index to get the group columns back.

Result for the small sample in the question is:

   count_date        time_gap device  count  average_speed
0  2018-12-05 0 days 00:00:00  CAT17      0            0.0
1  2018-12-05 0 days 01:00:00  CAT17      0            0.0
2  2023-05-04 0 days 22:00:00  CAT17      0            0.0
3  2023-05-04 0 days 23:00:00  CAT17      8           16.0

答案2

得分: 0

谢谢，它运行得非常好，我觉得你的解决方案很简洁，可以逐步理解它。以下是我得到的输出：

索引	计数日期	时间间隔	设备	计数	平均速度
0	2018-12-05	0天00:00:00	CAT17	0	0.000000
1	2018-12-05	0天01:00:00	CAT17	0	0.000000
2	2018-12-05	0天02:00:00	CAT17	0	0.000000
3	2018-12-05	0天03:00:00	CAT17	0	0.000000
4	2018-12-05	0天04:00:00	CAT17	0	0.000000
...	...	...	...	...	...
38683	2023-05-04	0天19:00:00	CAT17	56	16.339286
38684	2023-05-04	0天20:00:00	CAT17	39	20.179487
38685	2023-05-04	0天21:00:00	CAT17	14	16.142857
38686	2023-05-04	0天22:00:00	CAT17	4	17.500000
38687	2023-05-04	0天23:00:00	CAT17	8	16.000000

英文:

Thank you, it's working very well and I find your solution pretty neat to understand it step by step.

Here is the output I get:

index	count_date	time_gap	device	count	average_speed
0	2018-12-05	0 days 00:00:00	CAT17	0	0.000000
1	2018-12-05	0 days 01:00:00	CAT17	0	0.000000
2	2018-12-05	0 days 02:00:00	CAT17	0	0.000000
3	2018-12-05	0 days 03:00:00	CAT17	0	0.000000
4	2018-12-05	0 days 04:00:00	CAT17	0	0.000000
...	...	...	...	...	...
38683	2023-05-04	0 days 19:00:00	CAT17	56	16.339286
38684	2023-05-04	0 days 20:00:00	CAT17	39	20.179487
38685	2023-05-04	0 days 21:00:00	CAT17	14	16.142857
38686	2023-05-04	0 days 22:00:00	CAT17	4	17.500000
38687	2023-05-04	0 days 23:00:00	CAT17	8	16.000000

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas按时间间隔分组

问题

答案1

答案2

Pyparsing：如何从各个组收集所有命名结果？

如何获取每个单词的字母总数

比较通过套接字传递的UTF8字符串

我如何使用阈值条件来设置 Pandas 多级索引数据框列的样式？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论