Pandas按时间间隔分组

huangapple go评论68阅读模式
英文:

Pandas groupby time gaps

问题

我有一个名为df_merged的数据框,输出如下所示:

index count_date time_gap device count average_speed
0 2018-12-05 0 days 00:00:00 CAT17 0 0
1 2018-12-05 0 days 00:00:00 CAT17 0 0
2 2018-12-05 0 days 00:00:00 CAT17 0 0
3 2018-12-05 0 days 00:00:00 CAT17 0 0
4 2018-12-05 0 days 01:00:00 CAT17 0 0
... ... ... ... ... ...
154747 2023-05-04 0 days 22:00:00 CAT17 0 0
154748 2023-05-04 0 days 23:00:00 CAT17 4 16
154749 2023-05-04 0 days 23:00:00 CAT17 0 0
154750 2023-05-04 0 days 23:00:00 CAT17 1 13
154751 2023-05-04 0 days 23:00:00 CAT17 3 17

这是关于这个数据框的信息:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154752 entries, 0 to 154751
Data columns (total 5 columns):
Column Non-Null Count Dtype
0 count_date 154752 non-null datetime64[ns]
1 time_gap 0 non-null category
2 device 154752 non-null object
3 count 154752 non-null int32
4 average_speed 154752 non-null int32
dtypes: category(1), datetime64ns, int32(2), object(1)
memory usage: 3.7+ MB

time_gap是一个类别因为我使用了pd.cut()函数替换它。我不知道在这里是否最好更改dtype以进行分组。

我想要按time_gap分组,知道average_speed必须按照以下函数的计数加权:

def average_speed_mean(x):
    try: 
        return np.average(x["average_speed"], weights=x["count"])
    except ZeroDivisionError:
        return 0

我尝试过以下操作:

df_merged = df_merged.groupby("time_gap").agg({("count", sum), ("average_speed", average_speed_mean)})

但似乎不起作用,我不知道如何解决这个问题。

谢谢你的帮助。

英文:

I have a dataframe called df_merged which outputs like below:

index count_date time_gap device count average_speed
0 2018-12-05 0 days 00:00:00 CAT17 0 0
1 2018-12-05 0 days 00:00:00 CAT17 0 0
2 2018-12-05 0 days 00:00:00 CAT17 0 0
3 2018-12-05 0 days 00:00:00 CAT17 0 0
4 2018-12-05 0 days 01:00:00 CAT17 0 0
... ... ... ... ... ...
154747 2023-05-04 0 days 22:00:00 CAT17 0 0
154748 2023-05-04 0 days 23:00:00 CAT17 4 16
154749 2023-05-04 0 days 23:00:00 CAT17 0 0
154750 2023-05-04 0 days 23:00:00 CAT17 1 13
154751 2023-05-04 0 days 23:00:00 CAT17 3 17

Here is the info on this df:

&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 154752 entries, 0 to 154751
Data columns (total 5 columns):

     Column         Non-Null Count   Dtype                  
 0   count_date     154752 non-null  datetime64[ns]
 1   time_gap       0 non-null       category      
 2   device         154752 non-null  object        
 3   count          154752 non-null  int32         
 4   average_speed  154752 non-null  int32         
dtypes: category(1), datetime64[ns](1), int32(2), object(1)
memory usage: 3.7+ MB

time_gap is a category because I replaced it with a pd.cut() function. I don't know if it's best to change the dtype here to groupby.

I would like to groupby time_gap, knowing that the average_speed has to be weighted by count with this function:

def average_speed_mean(x):
    try: 
        return np.average(x[&quot;average_speed&quot;], weights=x[&quot;count&quot;])
    except ZeroDivisionError:
        return 0

I'm trying to have basically the same dataframe grouped like this:

index count_date time_gap device count average_speed
0 2018-12-05 0 days 00:00:00 CAT17 0 0
1 2018-12-05 0 days 01:00:00 CAT17 0 0
... ... ... ... ... ...
38687 2023-05-04 0 days 22:00:00 CAT17 0 0
38688 2023-05-04 0 days 23:00:00 CAT17 8 16

I tried this:

df_merged = df_merged.groupby(&quot;time_gap&quot;).agg({(&quot;count&quot;, sum), (&quot;average_speed&quot;, average_speed_mean)})

But it doesn't seem to work out and I have no idea how I could solve this.

Thank you in advance for your help.

答案1

得分: 1

以下是您要翻译的代码部分:

groups = ["count_date", "time_gap", "device"]
result = (
    df_merged
    .assign(average_speed=df_merged["average_speed"] * df_merged["count"])
    .groupby(groups).agg({"count": "sum", "average_speed": "sum"})
    .assign(average_speed=lambda df: df["average_speed"].div(df["count"]).fillna(0))
    .reset_index()
)

结果的小样本如下所示:

   count_date        time_gap device  count  average_speed
0  2018-12-05 0 days 00:00:00  CAT17      0            0.0
1  2018-12-05 0 days 01:00:00  CAT17      0            0.0
2  2023-05-04 0 days 22:00:00  CAT17      0            0.0
3  2023-05-04 0 days 23:00:00  CAT17      8           16.0

请注意,我只提供了代码和结果的翻译,没有其他内容。

英文:

Does the following give you the result you want:

groups = [&quot;count_date&quot;, &quot;time_gap&quot;, &quot;device&quot;]
result = (
    df_merged
    .assign(average_speed=df_merged[&quot;average_speed&quot;] * df_merged[&quot;count&quot;])
    .groupby(groups).agg({&quot;count&quot;: &quot;sum&quot;, &quot;average_speed&quot;: &quot;sum&quot;})
    .assign(average_speed=lambda df: df[&quot;average_speed&quot;].div(df[&quot;count&quot;]).fillna(0))
    .reset_index()
)
  • groups is the list of columns I think you want to group over - adjust it, if that's not the case.
  • Multiply the average_speed by the count.
  • Now group over the groups and sum up the counts and average_speeds per group.
  • Then divide the summed up average_speed by the total count and fill then NaN with 0.
  • Finally reset the index to get the group columns back.

Result for the small sample in the question is:

   count_date        time_gap device  count  average_speed
0  2018-12-05 0 days 00:00:00  CAT17      0            0.0
1  2018-12-05 0 days 01:00:00  CAT17      0            0.0
2  2023-05-04 0 days 22:00:00  CAT17      0            0.0
3  2023-05-04 0 days 23:00:00  CAT17      8           16.0

答案2

得分: 0

谢谢,它运行得非常好,我觉得你的解决方案很简洁,可以逐步理解它。以下是我得到的输出:

索引 计数日期 时间间隔 设备 计数 平均速度
0 2018-12-05 0天00:00:00 CAT17 0 0.000000
1 2018-12-05 0天01:00:00 CAT17 0 0.000000
2 2018-12-05 0天02:00:00 CAT17 0 0.000000
3 2018-12-05 0天03:00:00 CAT17 0 0.000000
4 2018-12-05 0天04:00:00 CAT17 0 0.000000
... ... ... ... ... ...
38683 2023-05-04 0天19:00:00 CAT17 56 16.339286
38684 2023-05-04 0天20:00:00 CAT17 39 20.179487
38685 2023-05-04 0天21:00:00 CAT17 14 16.142857
38686 2023-05-04 0天22:00:00 CAT17 4 17.500000
38687 2023-05-04 0天23:00:00 CAT17 8 16.000000
英文:

Thank you, it's working very well and I find your solution pretty neat to understand it step by step.

Here is the output I get:

index count_date time_gap device count average_speed
0 2018-12-05 0 days 00:00:00 CAT17 0 0.000000
1 2018-12-05 0 days 01:00:00 CAT17 0 0.000000
2 2018-12-05 0 days 02:00:00 CAT17 0 0.000000
3 2018-12-05 0 days 03:00:00 CAT17 0 0.000000
4 2018-12-05 0 days 04:00:00 CAT17 0 0.000000
... ... ... ... ... ...
38683 2023-05-04 0 days 19:00:00 CAT17 56 16.339286
38684 2023-05-04 0 days 20:00:00 CAT17 39 20.179487
38685 2023-05-04 0 days 21:00:00 CAT17 14 16.142857
38686 2023-05-04 0 days 22:00:00 CAT17 4 17.500000
38687 2023-05-04 0 days 23:00:00 CAT17 8 16.000000

huangapple
  • 本文由 发表于 2023年5月7日 17:30:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76193096.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定