使用Pandas按id列和每小时的日期时间分组,处理缺失的小时数。

huangapple go评论65阅读模式
英文:

Pandas Group by id column and hourly datetime with missing hours

问题

我有一个包含以下列的DataFrame:'Id'、'Date' 和 'Number'。我需要按 Id 对数据进行分组,并按小时汇总 Number。此外,我需要在 Id 的最小和最大日期时间之间没有数据的小时也要出现。

所以对于这个示例数据集:

Id    Date                    Number
1     01-01-2022 00:00:00     1
1     01-01-2022 00:25:00     3
1     01-01-2022 01:00:10     1
2     01-01-2022 00:00:01     4
2     01-01-2022 03:01:01     2

我期望的结果是:

Id    Date                    Number
1     01-01-2022 00:00:00     4
1     01-01-2022 01:00:00     1
2     01-01-2022 00:00:00     4
2     01-01-2022 01:00:00     NaN
2     01-01-2022 02:00:00     NaN
2     01-01-2022 03:00:00     2

我尝试了使用 groupby 和 Grouper(如下所示),但结果缺少小时:

agg = {'Number': 'sum'} #示例
data = data.groupby(['Id', pd.Grouper(key='Date', freq='1H')]).agg(agg)

这似乎很简单,但我无法让它工作。我漏掉了什么?

英文:

I have a DataFrame with the following columns: 'Id', 'Date', and 'Number'. I need to group the data by Id and aggregate the Number hourly. Also, I need the hours with no data between the min and max datetimes of a Id to be present.

So for the toy dataset:

Id    Date                    Number
1     01-01-2022 00:00:00     1
1     01-01-2022 00:25:00     3
1     01-01-2022 01:00:10     1
2     01-01-2022 00:00:01     4
2     01-01-2022 03:01:01     2

I would get:

Id    Date                    Number
1     01-01-2022 00:00:00     4
1     01-01-2022 01:00:00     1
2     01-01-2022 00:00:00     4
2     01-01-2022 01:00:00     NaN
2     01-01-2022 02:00:00     NaN
2     01-01-2022 03:00:00     2

I tried it with groupby and Grouper (as shown below), but it results in missing hours.

agg = {'Number': 'sum'} #example
data = data.groupby(['Id', pd.Grouper(key='Date', freq='1H')]).agg(agg)

It seems simple, but I cannot get it to work. What am I missing?

答案1

得分: 1

你可以使用.groupby().resample(),如果你将日期设置为索引:

df.set_index('Date').groupby('Id').resample('1h')['Number'].sum()
Id  Date               
1   2022-01-01 00:00:00    4
    2022-01-01 01:00:00    1
2   2022-01-01 00:00:00    4
    2022-01-01 01:00:00    0
    2022-01-01 02:00:00    0
    2022-01-01 03:00:00    2
Name: Number, dtype: int64
英文:

You can .groupby().resample() if you set the date as the index:

df.set_index('Date').groupby('Id').resample('1h')['Number'].sum()
Id  Date               
1   2022-01-01 00:00:00    4
    2022-01-01 01:00:00    1
2   2022-01-01 00:00:00    4
    2022-01-01 01:00:00    0
    2022-01-01 02:00:00    0
    2022-01-01 03:00:00    2
Name: Number, dtype: int64

huangapple
  • 本文由 发表于 2023年6月22日 08:19:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76527901.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定