英文:
Pandas groupby time gaps
问题
我有一个名为df_merged
的数据框,输出如下所示:
index | count_date | time_gap | device | count | average_speed |
---|---|---|---|---|---|
0 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
1 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
2 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
3 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
4 | 2018-12-05 | 0 days 01:00:00 | CAT17 | 0 | 0 |
... | ... | ... | ... | ... | ... |
154747 | 2023-05-04 | 0 days 22:00:00 | CAT17 | 0 | 0 |
154748 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 4 | 16 |
154749 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 0 | 0 |
154750 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 1 | 13 |
154751 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 3 | 17 |
这是关于这个数据框的信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154752 entries, 0 to 154751
Data columns (total 5 columns):
Column Non-Null Count Dtype
0 count_date 154752 non-null datetime64[ns]
1 time_gap 0 non-null category
2 device 154752 non-null object
3 count 154752 non-null int32
4 average_speed 154752 non-null int32
dtypes: category(1), datetime64ns, int32(2), object(1)
memory usage: 3.7+ MB
time_gap
是一个类别因为我使用了pd.cut()
函数替换它。我不知道在这里是否最好更改dtype
以进行分组。
我想要按time_gap
分组,知道average_speed
必须按照以下函数的计数加权:
def average_speed_mean(x):
try:
return np.average(x["average_speed"], weights=x["count"])
except ZeroDivisionError:
return 0
我尝试过以下操作:
df_merged = df_merged.groupby("time_gap").agg({("count", sum), ("average_speed", average_speed_mean)})
但似乎不起作用,我不知道如何解决这个问题。
谢谢你的帮助。
英文:
I have a dataframe called df_merged
which outputs like below:
index | count_date | time_gap | device | count | average_speed |
---|---|---|---|---|---|
0 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
1 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
2 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
3 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
4 | 2018-12-05 | 0 days 01:00:00 | CAT17 | 0 | 0 |
... | ... | ... | ... | ... | ... |
154747 | 2023-05-04 | 0 days 22:00:00 | CAT17 | 0 | 0 |
154748 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 4 | 16 |
154749 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 0 | 0 |
154750 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 1 | 13 |
154751 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 3 | 17 |
Here is the info on this df:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154752 entries, 0 to 154751
Data columns (total 5 columns):
Column Non-Null Count Dtype
0 count_date 154752 non-null datetime64[ns]
1 time_gap 0 non-null category
2 device 154752 non-null object
3 count 154752 non-null int32
4 average_speed 154752 non-null int32
dtypes: category(1), datetime64[ns](1), int32(2), object(1)
memory usage: 3.7+ MB
time_gap is a category because I replaced it with a pd.cut()
function. I don't know if it's best to change the dtype
here to groupby.
I would like to groupby time_gap
, knowing that the average_speed
has to be weighted by count with this function:
def average_speed_mean(x):
try:
return np.average(x["average_speed"], weights=x["count"])
except ZeroDivisionError:
return 0
I'm trying to have basically the same dataframe grouped like this:
index | count_date | time_gap | device | count | average_speed |
---|---|---|---|---|---|
0 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0 |
1 | 2018-12-05 | 0 days 01:00:00 | CAT17 | 0 | 0 |
... | ... | ... | ... | ... | ... |
38687 | 2023-05-04 | 0 days 22:00:00 | CAT17 | 0 | 0 |
38688 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 8 | 16 |
I tried this:
df_merged = df_merged.groupby("time_gap").agg({("count", sum), ("average_speed", average_speed_mean)})
But it doesn't seem to work out and I have no idea how I could solve this.
Thank you in advance for your help.
答案1
得分: 1
以下是您要翻译的代码部分:
groups = ["count_date", "time_gap", "device"]
result = (
df_merged
.assign(average_speed=df_merged["average_speed"] * df_merged["count"])
.groupby(groups).agg({"count": "sum", "average_speed": "sum"})
.assign(average_speed=lambda df: df["average_speed"].div(df["count"]).fillna(0))
.reset_index()
)
结果的小样本如下所示:
count_date time_gap device count average_speed
0 2018-12-05 0 days 00:00:00 CAT17 0 0.0
1 2018-12-05 0 days 01:00:00 CAT17 0 0.0
2 2023-05-04 0 days 22:00:00 CAT17 0 0.0
3 2023-05-04 0 days 23:00:00 CAT17 8 16.0
请注意,我只提供了代码和结果的翻译,没有其他内容。
英文:
Does the following give you the result you want:
groups = ["count_date", "time_gap", "device"]
result = (
df_merged
.assign(average_speed=df_merged["average_speed"] * df_merged["count"])
.groupby(groups).agg({"count": "sum", "average_speed": "sum"})
.assign(average_speed=lambda df: df["average_speed"].div(df["count"]).fillna(0))
.reset_index()
)
groups
is the list of columns I think you want to group over - adjust it, if that's not the case.- Multiply the
average_speed
by thecount
. - Now group over the
groups
and sum up thecount
s andaverage_speed
s per group. - Then divide the summed up
average_speed
by the total count and fill thenNaN
with0
. - Finally reset the index to get the group columns back.
Result for the small sample in the question is:
count_date time_gap device count average_speed
0 2018-12-05 0 days 00:00:00 CAT17 0 0.0
1 2018-12-05 0 days 01:00:00 CAT17 0 0.0
2 2023-05-04 0 days 22:00:00 CAT17 0 0.0
3 2023-05-04 0 days 23:00:00 CAT17 8 16.0
答案2
得分: 0
谢谢,它运行得非常好,我觉得你的解决方案很简洁,可以逐步理解它。以下是我得到的输出:
索引 | 计数日期 | 时间间隔 | 设备 | 计数 | 平均速度 |
---|---|---|---|---|---|
0 | 2018-12-05 | 0天00:00:00 | CAT17 | 0 | 0.000000 |
1 | 2018-12-05 | 0天01:00:00 | CAT17 | 0 | 0.000000 |
2 | 2018-12-05 | 0天02:00:00 | CAT17 | 0 | 0.000000 |
3 | 2018-12-05 | 0天03:00:00 | CAT17 | 0 | 0.000000 |
4 | 2018-12-05 | 0天04:00:00 | CAT17 | 0 | 0.000000 |
... | ... | ... | ... | ... | ... |
38683 | 2023-05-04 | 0天19:00:00 | CAT17 | 56 | 16.339286 |
38684 | 2023-05-04 | 0天20:00:00 | CAT17 | 39 | 20.179487 |
38685 | 2023-05-04 | 0天21:00:00 | CAT17 | 14 | 16.142857 |
38686 | 2023-05-04 | 0天22:00:00 | CAT17 | 4 | 17.500000 |
38687 | 2023-05-04 | 0天23:00:00 | CAT17 | 8 | 16.000000 |
英文:
Thank you, it's working very well and I find your solution pretty neat to understand it step by step.
Here is the output I get:
index | count_date | time_gap | device | count | average_speed |
---|---|---|---|---|---|
0 | 2018-12-05 | 0 days 00:00:00 | CAT17 | 0 | 0.000000 |
1 | 2018-12-05 | 0 days 01:00:00 | CAT17 | 0 | 0.000000 |
2 | 2018-12-05 | 0 days 02:00:00 | CAT17 | 0 | 0.000000 |
3 | 2018-12-05 | 0 days 03:00:00 | CAT17 | 0 | 0.000000 |
4 | 2018-12-05 | 0 days 04:00:00 | CAT17 | 0 | 0.000000 |
... | ... | ... | ... | ... | ... |
38683 | 2023-05-04 | 0 days 19:00:00 | CAT17 | 56 | 16.339286 |
38684 | 2023-05-04 | 0 days 20:00:00 | CAT17 | 39 | 20.179487 |
38685 | 2023-05-04 | 0 days 21:00:00 | CAT17 | 14 | 16.142857 |
38686 | 2023-05-04 | 0 days 22:00:00 | CAT17 | 4 | 17.500000 |
38687 | 2023-05-04 | 0 days 23:00:00 | CAT17 | 8 | 16.000000 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论