2023年6月16日 03:09:07go评论109阅读模式

英文:

How to calculate the maximum occurance in a rolling window?

问题

以下是翻译好的内容：

假设我有一个如下的数据框：

--------------------------------------------------
| 类型        | 事件ID          | 事件日期        |
--------------------------------------------------
| A          | 1               | 2022-02-12      |
| A          | 2               | 2022-02-14      |
| A          | 3               | 2022-02-14      |
| A          | 4               | 2022-02-14      |
| A          | 5               | 2022-02-16      |
| A          | 6               | 2022-02-17      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |
 ...          ...               ...             
| B          | 1               | 2022-02-12      |
| B          | 2               | 2022-02-12      |
| B          | 3               | 2022-02-13      |
 ...          ...               ...             
--------------------------------------------------

这是一个不同类型的事件列表。每个事件都有一个类型、一个ID和发生日期。这只是一个示例，以帮助理解我的目标。

我想要的是 - 在给定的时间范围内，例如5天 - 滚动累积这些事件的最大值会变成多少：

所以我将从落入前5天的所有元素开始，累积事件发生次数：6。

2022-02-12 - 2022-02-17: 6

通过从一天开始滚动窗口，第一天的所有元素都从总数中消除，这种情况下是-1，下一天也没有添加任何元素。下一个值将是5。

2022-02-13 - 2022-02-18: 5

6 > 5。因此，6仍然是5天窗口内事件发生的最大次数。

继续进行整个时间范围。

这并不难实现，但如何以非常高效的方式处理数百万个元素呢？简而言之：我想创建一个固定日期范围（例如5天）的移动窗口，计算此窗口内的所有事件发生次数，并输出达到的最大值。

英文:

Say I have a data frame as follows:

--------------------------------------------------
| Type       | Incident ID     | Date of incident|
--------------------------------------------------
| A          | 1               | 2022-02-12      |
| A          | 2               | 2022-02-14      |
| A          | 3               | 2022-02-14      |
| A          | 4               | 2022-02-14      |
| A          | 5               | 2022-02-16      |
| A          | 6               | 2022-02-17      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |
 ...          ...               ...             
| B          | 1               | 2022-02-12      |
| B          | 2               | 2022-02-12      |
| B          | 3               | 2022-02-13      |
 ...          ...               ...             
--------------------------------------------------

This is a list of different types of incidents. Every incident has a type, an id and a date, at which it occurred. This is just an example to help understand my goal.

What I want is - for a given range, e.g. 5 days - the maximum value that a rolling sum over these incidents would become:

So I would start with all elements that fall into the first 5 days and accumulate the occurences: 6.

2022-02-12 - 2022-02-17:    6

By starting to roll the window by one day, all elements of the first day get eliminated from the sum, in this case -1 and no element for the next day in line gets added. The next value would be 5.

2022-02-13 - 2022-02-18:    5

6 > 5. So 6 is still the maximum occurence of incidents in a 5 day window.

Continue for the complete time range.

This is not that hard to achieve but how would I do this in a very efficient manner for millions of elements? In short: I want to create a moving window of a fixed date range (e.g. 5 days), count all occurances for this window and give out the maximum value that was reached for any window.

答案1

得分: 1

我已经进行了一些研究，似乎pd.rolling(window=5)在处理大型数据集，特别是多个列的情况下成本相对较高。

然而，我认为pd.Grouper()是你所需要的。

这是我编写的代码片段：

import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 
        'type': ['A','A','A','A','B','B','B','B','C','C','C','C','C','C'],
        'time': [
                '2022-02-12', '2022-02-13',
                '2022-02-14', '2022-02-14',
                '2022-02-14', '2022-02-14',
                '2022-02-17', '2022-02-17',
                '2022-02-18', '2022-02-19',
                '2022-02-21', '2022-02-21',
                '2022-02-22', '2022-02-22']
}
test = pd.DataFrame(data).astype({'time': 'datetime64[ns]'})
#rollingg = test.rolling(window=5)
#(
#   test.assign(result= rollingg.ID.count())
#)
# 在这里，我们将每5天的数据分组在一起，然后计数，如果需要考虑类型，请添加到groupby中
(
   test
   .groupby([pd.Grouper(key='time', freq='5D', closed='left')])
   .agg(counted=pd.NamedAgg(column='ID', aggfunc='count'))
)

希望这对你有帮助！

英文:

I have done some research and it seems that pd.rolling(window=5) is quite costly when it comes to big datasets and especially on multiple columns.

However, I believe that pd.Grouper() is what you need.

here is the snippet code I did

import pandas as pd
data = {&#39;ID&#39;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 
    &#39;type&#39;: [&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;,&#39;B&#39;,&#39;B&#39;,&#39;C&#39;,&#39;C&#39;,&#39;C&#39;,&#39;C&#39;,&#39;C&#39;,&#39;C&#39;],
    &#39;time&#39;: [
            &#39;2022-02-12&#39;, &#39;2022-02-13&#39;,
            &#39;2022-02-14&#39;, &#39;2022-02-14&#39;,
            &#39;2022-02-14&#39;, &#39;2022-02-14&#39;,
            &#39;2022-02-17&#39;, &#39;2022-02-17&#39;,
            &#39;2022-02-18&#39;, &#39;2022-02-19&#39;,
            &#39;2022-02-21&#39;, &#39;2022-02-21&#39;,
            &#39;2022-02-22&#39;, &#39;2022-02-22&#39;]
            }
test = pd.DataFrame(data).astype({&#39;time&#39;: &#39;datetime64[ns]&#39;})
#rollingg = test.rolling(window=5)
#(
#   test.assign(result= rollingg.ID.count())
#)
# Here, we group every 5 days together and we count, if you need the type 
#  into account, just add it to the groupby
(
   test
   .groupby([pd.Grouper(key=&#39;time&#39;, freq=&#39;5D&#39;, closed=&#39;left&#39;)])
   .agg(counted=pd.NamedAgg(column=&#39;ID&#39;, aggfunc=&#39;count&#39;))
)

I hope this helps!

答案2

得分: 1

你可以使用 pivot_table 计算密集矩阵（日期，类型），然后使用 resample 填充缺失的日期。最后，沿着索引轴应用滚动求和：

df['Date of incident'] = pd.to_datetime(df['Date of incident'])
out = (df.pivot_table(index='Date of incident', columns='Type',
                      values='Incident ID', aggfunc='count')
         .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

编辑: 我认为 pd.crosstab 可能比 pd.pivot_table 更快：

out = (pd.crosstab(df['Date of incident'], df['Type'])
         .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

输出:

>>> out
Type              A  B
Date of incident          
2022-02-12        1  2
2022-02-13        1  3
2022-02-14        4  3
2022-02-15        4  3
2022-02-16        5  3
2022-02-17        6  3  # A: 在 2022-02-12 到 2022-02-17 之间的总数为 6
2022-02-18        5  1  # A: 在 2022-02-13 到 2022-02-18 之间的总数为 5
2022-02-19        9  0

注意: 计算所有组合可能会是一个耗时的过程。

有了这个形状，你可以轻松地绘制你的数据：

out.plot(figsize=(6, 4), title='Rolling count (5 days)',
         ylabel='Number of incident', xlabel='Date')
plt.tight_layout()
plt.show()

英文:

You can use pivot_table to compute the dense matrix (Date, Type) then resample to fill missing dates. Finally apply a rolling sum along index axis:

df[&#39;Date of incident&#39;] = pd.to_datetime(df[&#39;Date of incident&#39;])
out = (df.pivot_table(index=&#39;Date of incident&#39;, columns=&#39;Type&#39;,
                      values=&#39;Incident ID&#39;, aggfunc=&#39;count&#39;)
         .resample(&#39;D&#39;).sum().rolling(&#39;5D&#39;, closed=&#39;both&#39;).sum().astype(int))

EDIT: I think pd.crosstab may be faster than pd.pivot_table:

out = (pd.crosstab(df[&#39;Date of incident&#39;], df[&#39;Type&#39;])
         .resample(&#39;D&#39;).sum().rolling(&#39;5D&#39;, closed=&#39;both&#39;).sum().astype(int))

Output:

&gt;&gt;&gt; out
Type              A  B
Date of incident          
2022-02-12        1  2
2022-02-13        1  3
2022-02-14        4  3
2022-02-15        4  3
2022-02-16        5  3
2022-02-17        6  3  # A: 6 between 2022-02-12 and 2022-02-17 included
2022-02-18        5  1  # A: 5 between 2022-02-13 and 2022-02-18 included
2022-02-19        9  0

Note: compute all combinations can be a heavy process.

With this shape, you can plot your data easily:

out.plot(figsize=(6, 4), title=&#39;Rolling count (5 days)&#39;,
         ylabel=&#39;Number of incident&#39;, xlabel=&#39;Date&#39;)
plt.tight_layout()
plt.show()

答案3

得分: 1

以下是翻译好的代码部分：

# 示例数据
data = {"Type": list("A" * 10) + list("B" * 10),
        "Incident_ID": np.arange(20),
        "Date": ['2022-02-12', '2022-02-13', '2022-02-14', '2022-02-14', '2022-02-14', '2022-02-15', '2022-02-15',
                 '2022-02-16', '2022-02-17', '2022-02-18', '2022-02-19', '2022-02-19', '2022-02-19',
                 '2022-02-20', '2022-02-21', '2022-02-22', '2022-02-23', '2022-02-24', '2022-02-25', '2022-02-26']}
df = pd.DataFrame(data)
print(df.head())
# 步骤1
df1 = df.groupby(["Type", "Date"], as_index=False).size().rename(columns={"size": "No_of_incidents"})
print(df1.head())
# 仅针对 Type-A：
df2 = df1[df1['Type'] == "A"].assign(rolling_ID1=df1['No_of_incidents'].rolling(5).sum(),
                                      rolling_ID2=df1.iloc[1:, :]['No_of_incidents'].rolling(5).sum().reset_index(drop=True))
# 最后，我添加了 max_incidents 列，显示 rolling_ID1 和 ID2 之间的最大值。
df2['max_incidents'] = df2[['rolling_ID1', 'rolling_ID2']].max(axis=1)
print(df2.head())

请注意，此翻译保留了代码的结构和格式，并将代码中的标识符翻译成了中文。

英文:

Here's my approach: Firstly, I have grouped the data on the basis of incident type and date and then add the incidents.

# Sample data
data={&quot;Type&quot;:list(&quot;A&quot;*10)+list(&quot;B&quot;*10),
&quot;Incident_ID&quot;:np.arange(20),
&quot;Date&quot;:[&#39;2022-02-12&#39;, &#39;2022-02-13&#39;, &#39;2022-02-14&#39;,&#39;2022-02-14&#39;,&#39;2022-02-14&#39;,&#39;2022-02-15&#39;,&#39;2022-02-15&#39;,
                &#39;2022-02-16&#39;, &#39;2022-02-17&#39;, &#39;2022-02-18&#39;, &#39;2022-02-19&#39;,&#39;2022-02-19&#39;,&#39;2022-02-19&#39;,
                &#39;2022-02-20&#39;, &#39;2022-02-21&#39;, &#39;2022-02-22&#39;, &#39;2022-02-23&#39;,
                &#39;2022-02-24&#39;, &#39;2022-02-25&#39;, &#39;2022-02-26&#39;]}
df=pd.DataFrame(data)
print(df.head())
   Type	Incident_ID	      Date
0	A	   0	    2022-02-12
1	A	   1	    2022-02-13
2	A	   2	    2022-02-14
3	A	   3	    2022-02-14
4	A	   4	    2022-02-14
# Step 1
df1=df.groupby([&quot;Type&quot;,&quot;Date&quot;],as_index=False).size().rename(columns={&quot;size&quot;:&quot;No_of_incidents&quot;})
print(df1.head())
  Type	      Date	No_of_incidents
0	A	2022-02-12	     1
1	A	2022-02-13	     1
2	A	2022-02-14	     3
3	A	2022-02-15	     2

Now I have created two columns rolling_ID1(sum of first 5) and rolling_ID2(sum of next 5). The rolling_ID2 is shifted 1 step up in order to match rolling_ID1.

# Only for Type-A:
df2=df1[df1[&#39;Type&#39;]==&quot;A&quot;].assign(rolling_ID1=df1[&#39;No_of_incidents&#39;].rolling(5).sum(),
          rolling_ID2=df1.iloc[1:,:][&#39;No_of_incidents&#39;].rolling(5).sum().reset_index(drop=True))

Finally, I'm adding max_incidents column that shows max value between rolling_ID1 & ID2.

df2[&#39;max_incidents&#39;]=df2[[&#39;rolling_ID1&#39;,&#39;rolling_ID2&#39;]].max(axis=1)
print(df2.head())
  Type	      Date	No_of_incidents	rolling_ID1	rolling_ID2	max_incidents
0	A	2022-02-12	       1	            NaN	       NaN	      NaN
1	A	2022-02-13	       1	            NaN        NaN	      NaN
2	A	2022-02-14	       3	            NaN	       NaN	      NaN
3	A	2022-02-15	       2	            NaN   	   NaN	      NaN
4	A	2022-02-16	       1	            8.0	       8.0	      8.0
5	A	2022-02-17	       1	            8.0	       8.0	      8.0
6	A	2022-02-18	       1	            8.0	       8.0	      8.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何计算滚动窗口中的最大出现次数？

问题

答案1

答案2

答案3

如何重新创建一个gridspec subplot?

可以在循环检查数据库时更改数据库吗？sqlite3

如何在使用FlaskForm的表单中获取输入文件的MIME类型？

使用正则表达式查找在搜索字符串之后的特定长度的数字数量。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论