英文:
How to calculate the maximum occurance in a rolling window?
问题
以下是翻译好的内容:
假设我有一个如下的数据框:
--------------------------------------------------
| 类型 | 事件ID | 事件日期 |
--------------------------------------------------
| A | 1 | 2022-02-12 |
| A | 2 | 2022-02-14 |
| A | 3 | 2022-02-14 |
| A | 4 | 2022-02-14 |
| A | 5 | 2022-02-16 |
| A | 6 | 2022-02-17 |
| A | 7 | 2022-02-19 |
| A | 8 | 2022-02-19 |
| A | 7 | 2022-02-19 |
| A | 8 | 2022-02-19 |
... ... ...
| B | 1 | 2022-02-12 |
| B | 2 | 2022-02-12 |
| B | 3 | 2022-02-13 |
... ... ...
--------------------------------------------------
这是一个不同类型的事件列表。每个事件都有一个类型、一个ID和发生日期。这只是一个示例,以帮助理解我的目标。
我想要的是 - 在给定的时间范围内,例如5天 - 滚动累积这些事件的最大值会变成多少:
所以我将从落入前5天的所有元素开始,累积事件发生次数:6。
2022-02-12 - 2022-02-17: 6
通过从一天开始滚动窗口,第一天的所有元素都从总数中消除,这种情况下是-1,下一天也没有添加任何元素。下一个值将是5。
2022-02-13 - 2022-02-18: 5
6 > 5。因此,6仍然是5天窗口内事件发生的最大次数。
继续进行整个时间范围。
这并不难实现,但如何以非常高效的方式处理数百万个元素呢?简而言之:我想创建一个固定日期范围(例如5天)的移动窗口,计算此窗口内的所有事件发生次数,并输出达到的最大值。
英文:
Say I have a data frame as follows:
--------------------------------------------------
| Type | Incident ID | Date of incident|
--------------------------------------------------
| A | 1 | 2022-02-12 |
| A | 2 | 2022-02-14 |
| A | 3 | 2022-02-14 |
| A | 4 | 2022-02-14 |
| A | 5 | 2022-02-16 |
| A | 6 | 2022-02-17 |
| A | 7 | 2022-02-19 |
| A | 8 | 2022-02-19 |
| A | 7 | 2022-02-19 |
| A | 8 | 2022-02-19 |
... ... ...
| B | 1 | 2022-02-12 |
| B | 2 | 2022-02-12 |
| B | 3 | 2022-02-13 |
... ... ...
--------------------------------------------------
This is a list of different types of incidents. Every incident has a type, an id and a date, at which it occurred. This is just an example to help understand my goal.
What I want is - for a given range, e.g. 5 days - the maximum value that a rolling sum over these incidents would become:
So I would start with all elements that fall into the first 5 days and accumulate the occurences: 6.
2022-02-12 - 2022-02-17: 6
By starting to roll the window by one day, all elements of the first day get eliminated from the sum, in this case -1 and no element for the next day in line gets added. The next value would be 5.
2022-02-13 - 2022-02-18: 5
6 > 5. So 6 is still the maximum occurence of incidents in a 5 day window.
Continue for the complete time range.
This is not that hard to achieve but how would I do this in a very efficient manner for millions of elements? In short: I want to create a moving window of a fixed date range (e.g. 5 days), count all occurances for this window and give out the maximum value that was reached for any window.
答案1
得分: 1
我已经进行了一些研究,似乎pd.rolling(window=5)
在处理大型数据集,特别是多个列的情况下成本相对较高。
然而,我认为pd.Grouper()
是你所需要的。
这是我编写的代码片段:
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'type': ['A','A','A','A','B','B','B','B','C','C','C','C','C','C'],
'time': [
'2022-02-12', '2022-02-13',
'2022-02-14', '2022-02-14',
'2022-02-14', '2022-02-14',
'2022-02-17', '2022-02-17',
'2022-02-18', '2022-02-19',
'2022-02-21', '2022-02-21',
'2022-02-22', '2022-02-22']
}
test = pd.DataFrame(data).astype({'time': 'datetime64[ns]'})
#rollingg = test.rolling(window=5)
#(
# test.assign(result= rollingg.ID.count())
#)
# 在这里,我们将每5天的数据分组在一起,然后计数,如果需要考虑类型,请添加到groupby中
(
test
.groupby([pd.Grouper(key='time', freq='5D', closed='left')])
.agg(counted=pd.NamedAgg(column='ID', aggfunc='count'))
)
希望这对你有帮助!
英文:
I have done some research and it seems that pd.rolling(window=5)
is quite costly when it comes to big datasets and especially on multiple columns.
However, I believe that pd.Grouper()
is what you need.
here is the snippet code I did
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'type': ['A','A','A','A','B','B','B','B','C','C','C','C','C','C'],
'time': [
'2022-02-12', '2022-02-13',
'2022-02-14', '2022-02-14',
'2022-02-14', '2022-02-14',
'2022-02-17', '2022-02-17',
'2022-02-18', '2022-02-19',
'2022-02-21', '2022-02-21',
'2022-02-22', '2022-02-22']
}
test = pd.DataFrame(data).astype({'time': 'datetime64[ns]'})
#rollingg = test.rolling(window=5)
#(
# test.assign(result= rollingg.ID.count())
#)
# Here, we group every 5 days together and we count, if you need the type
# into account, just add it to the groupby
(
test
.groupby([pd.Grouper(key='time', freq='5D', closed='left')])
.agg(counted=pd.NamedAgg(column='ID', aggfunc='count'))
)
I hope this helps!
答案2
得分: 1
你可以使用 pivot_table
计算密集矩阵(日期,类型),然后使用 resample
填充缺失的日期。最后,沿着索引轴应用滚动求和:
df['Date of incident'] = pd.to_datetime(df['Date of incident'])
out = (df.pivot_table(index='Date of incident', columns='Type',
values='Incident ID', aggfunc='count')
.resample('D').sum().rolling('5D', closed='both').sum().astype(int))
编辑: 我认为 pd.crosstab
可能比 pd.pivot_table
更快:
out = (pd.crosstab(df['Date of incident'], df['Type'])
.resample('D').sum().rolling('5D', closed='both').sum().astype(int))
输出:
>>> out
Type A B
Date of incident
2022-02-12 1 2
2022-02-13 1 3
2022-02-14 4 3
2022-02-15 4 3
2022-02-16 5 3
2022-02-17 6 3 # A: 在 2022-02-12 到 2022-02-17 之间的总数为 6
2022-02-18 5 1 # A: 在 2022-02-13 到 2022-02-18 之间的总数为 5
2022-02-19 9 0
注意: 计算所有组合可能会是一个耗时的过程。
有了这个形状,你可以轻松地绘制你的数据:
out.plot(figsize=(6, 4), title='Rolling count (5 days)',
ylabel='Number of incident', xlabel='Date')
plt.tight_layout()
plt.show()
英文:
You can use pivot_table
to compute the dense matrix (Date, Type) then resample
to fill missing dates. Finally apply a rolling sum along index axis:
df['Date of incident'] = pd.to_datetime(df['Date of incident'])
out = (df.pivot_table(index='Date of incident', columns='Type',
values='Incident ID', aggfunc='count')
.resample('D').sum().rolling('5D', closed='both').sum().astype(int))
EDIT: I think pd.crosstab
may be faster than pd.pivot_table
:
out = (pd.crosstab(df['Date of incident'], df['Type'])
.resample('D').sum().rolling('5D', closed='both').sum().astype(int))
Output:
>>> out
Type A B
Date of incident
2022-02-12 1 2
2022-02-13 1 3
2022-02-14 4 3
2022-02-15 4 3
2022-02-16 5 3
2022-02-17 6 3 # A: 6 between 2022-02-12 and 2022-02-17 included
2022-02-18 5 1 # A: 5 between 2022-02-13 and 2022-02-18 included
2022-02-19 9 0
Note: compute all combinations can be a heavy process.
With this shape, you can plot your data easily:
out.plot(figsize=(6, 4), title='Rolling count (5 days)',
ylabel='Number of incident', xlabel='Date')
plt.tight_layout()
plt.show()
答案3
得分: 1
以下是翻译好的代码部分:
# 示例数据
data = {"Type": list("A" * 10) + list("B" * 10),
"Incident_ID": np.arange(20),
"Date": ['2022-02-12', '2022-02-13', '2022-02-14', '2022-02-14', '2022-02-14', '2022-02-15', '2022-02-15',
'2022-02-16', '2022-02-17', '2022-02-18', '2022-02-19', '2022-02-19', '2022-02-19',
'2022-02-20', '2022-02-21', '2022-02-22', '2022-02-23', '2022-02-24', '2022-02-25', '2022-02-26']}
df = pd.DataFrame(data)
print(df.head())
# 步骤1
df1 = df.groupby(["Type", "Date"], as_index=False).size().rename(columns={"size": "No_of_incidents"})
print(df1.head())
# 仅针对 Type-A:
df2 = df1[df1['Type'] == "A"].assign(rolling_ID1=df1['No_of_incidents'].rolling(5).sum(),
rolling_ID2=df1.iloc[1:, :]['No_of_incidents'].rolling(5).sum().reset_index(drop=True))
# 最后,我添加了 max_incidents 列,显示 rolling_ID1 和 ID2 之间的最大值。
df2['max_incidents'] = df2[['rolling_ID1', 'rolling_ID2']].max(axis=1)
print(df2.head())
请注意,此翻译保留了代码的结构和格式,并将代码中的标识符翻译成了中文。
英文:
Here's my approach: Firstly, I have grouped the data on the basis of incident type and date and then add the incidents.
# Sample data
data={"Type":list("A"*10)+list("B"*10),
"Incident_ID":np.arange(20),
"Date":['2022-02-12', '2022-02-13', '2022-02-14','2022-02-14','2022-02-14','2022-02-15','2022-02-15',
'2022-02-16', '2022-02-17', '2022-02-18', '2022-02-19','2022-02-19','2022-02-19',
'2022-02-20', '2022-02-21', '2022-02-22', '2022-02-23',
'2022-02-24', '2022-02-25', '2022-02-26']}
df=pd.DataFrame(data)
print(df.head())
Type Incident_ID Date
0 A 0 2022-02-12
1 A 1 2022-02-13
2 A 2 2022-02-14
3 A 3 2022-02-14
4 A 4 2022-02-14
# Step 1
df1=df.groupby(["Type","Date"],as_index=False).size().rename(columns={"size":"No_of_incidents"})
print(df1.head())
Type Date No_of_incidents
0 A 2022-02-12 1
1 A 2022-02-13 1
2 A 2022-02-14 3
3 A 2022-02-15 2
Now I have created two columns rolling_ID1(sum of first 5) and rolling_ID2(sum of next 5). The rolling_ID2 is shifted 1 step up in order to match rolling_ID1.
# Only for Type-A:
df2=df1[df1['Type']=="A"].assign(rolling_ID1=df1['No_of_incidents'].rolling(5).sum(),
rolling_ID2=df1.iloc[1:,:]['No_of_incidents'].rolling(5).sum().reset_index(drop=True))
Finally, I'm adding max_incidents column that shows max value between rolling_ID1 & ID2.
df2['max_incidents']=df2[['rolling_ID1','rolling_ID2']].max(axis=1)
print(df2.head())
Type Date No_of_incidents rolling_ID1 rolling_ID2 max_incidents
0 A 2022-02-12 1 NaN NaN NaN
1 A 2022-02-13 1 NaN NaN NaN
2 A 2022-02-14 3 NaN NaN NaN
3 A 2022-02-15 2 NaN NaN NaN
4 A 2022-02-16 1 8.0 8.0 8.0
5 A 2022-02-17 1 8.0 8.0 8.0
6 A 2022-02-18 1 8.0 8.0 8.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论