如何计算滚动窗口中的最大出现次数?

huangapple go评论109阅读模式
英文:

How to calculate the maximum occurance in a rolling window?

问题

以下是翻译好的内容:

假设我有一个如下的数据框:

  1. --------------------------------------------------
  2. | 类型 | 事件ID | 事件日期 |
  3. --------------------------------------------------
  4. | A | 1 | 2022-02-12 |
  5. | A | 2 | 2022-02-14 |
  6. | A | 3 | 2022-02-14 |
  7. | A | 4 | 2022-02-14 |
  8. | A | 5 | 2022-02-16 |
  9. | A | 6 | 2022-02-17 |
  10. | A | 7 | 2022-02-19 |
  11. | A | 8 | 2022-02-19 |
  12. | A | 7 | 2022-02-19 |
  13. | A | 8 | 2022-02-19 |
  14. ... ... ...
  15. | B | 1 | 2022-02-12 |
  16. | B | 2 | 2022-02-12 |
  17. | B | 3 | 2022-02-13 |
  18. ... ... ...
  19. --------------------------------------------------

这是一个不同类型的事件列表。每个事件都有一个类型、一个ID和发生日期。这只是一个示例,以帮助理解我的目标。

我想要的是 - 在给定的时间范围内,例如5天 - 滚动累积这些事件的最大值会变成多少:

所以我将从落入前5天的所有元素开始,累积事件发生次数:6。

2022-02-12 - 2022-02-17: 6

通过从一天开始滚动窗口,第一天的所有元素都从总数中消除,这种情况下是-1,下一天也没有添加任何元素。下一个值将是5。

2022-02-13 - 2022-02-18: 5

6 > 5。因此,6仍然是5天窗口内事件发生的最大次数。

继续进行整个时间范围。

这并不难实现,但如何以非常高效的方式处理数百万个元素呢?简而言之:我想创建一个固定日期范围(例如5天)的移动窗口,计算此窗口内的所有事件发生次数,并输出达到的最大值。

英文:

Say I have a data frame as follows:

  1. --------------------------------------------------
  2. | Type | Incident ID | Date of incident|
  3. --------------------------------------------------
  4. | A | 1 | 2022-02-12 |
  5. | A | 2 | 2022-02-14 |
  6. | A | 3 | 2022-02-14 |
  7. | A | 4 | 2022-02-14 |
  8. | A | 5 | 2022-02-16 |
  9. | A | 6 | 2022-02-17 |
  10. | A | 7 | 2022-02-19 |
  11. | A | 8 | 2022-02-19 |
  12. | A | 7 | 2022-02-19 |
  13. | A | 8 | 2022-02-19 |
  14. ... ... ...
  15. | B | 1 | 2022-02-12 |
  16. | B | 2 | 2022-02-12 |
  17. | B | 3 | 2022-02-13 |
  18. ... ... ...
  19. --------------------------------------------------

This is a list of different types of incidents. Every incident has a type, an id and a date, at which it occurred. This is just an example to help understand my goal.

What I want is - for a given range, e.g. 5 days - the maximum value that a rolling sum over these incidents would become:

So I would start with all elements that fall into the first 5 days and accumulate the occurences: 6.

  1. 2022-02-12 - 2022-02-17: 6

By starting to roll the window by one day, all elements of the first day get eliminated from the sum, in this case -1 and no element for the next day in line gets added. The next value would be 5.

  1. 2022-02-13 - 2022-02-18: 5

6 > 5. So 6 is still the maximum occurence of incidents in a 5 day window.

Continue for the complete time range.

This is not that hard to achieve but how would I do this in a very efficient manner for millions of elements? In short: I want to create a moving window of a fixed date range (e.g. 5 days), count all occurances for this window and give out the maximum value that was reached for any window.

答案1

得分: 1

我已经进行了一些研究,似乎pd.rolling(window=5)在处理大型数据集,特别是多个列的情况下成本相对较高。

然而,我认为pd.Grouper()是你所需要的。

这是我编写的代码片段:

  1. import pandas as pd
  2. data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
  3. 'type': ['A','A','A','A','B','B','B','B','C','C','C','C','C','C'],
  4. 'time': [
  5. '2022-02-12', '2022-02-13',
  6. '2022-02-14', '2022-02-14',
  7. '2022-02-14', '2022-02-14',
  8. '2022-02-17', '2022-02-17',
  9. '2022-02-18', '2022-02-19',
  10. '2022-02-21', '2022-02-21',
  11. '2022-02-22', '2022-02-22']
  12. }
  13. test = pd.DataFrame(data).astype({'time': 'datetime64[ns]'})
  14. #rollingg = test.rolling(window=5)
  15. #(
  16. # test.assign(result= rollingg.ID.count())
  17. #)
  18. # 在这里,我们将每5天的数据分组在一起,然后计数,如果需要考虑类型,请添加到groupby中
  19. (
  20. test
  21. .groupby([pd.Grouper(key='time', freq='5D', closed='left')])
  22. .agg(counted=pd.NamedAgg(column='ID', aggfunc='count'))
  23. )

希望这对你有帮助!

英文:

I have done some research and it seems that pd.rolling(window=5) is quite costly when it comes to big datasets and especially on multiple columns.

However, I believe that pd.Grouper() is what you need.

here is the snippet code I did

  1. import pandas as pd
  2. data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
  3. 'type': ['A','A','A','A','B','B','B','B','C','C','C','C','C','C'],
  4. 'time': [
  5. '2022-02-12', '2022-02-13',
  6. '2022-02-14', '2022-02-14',
  7. '2022-02-14', '2022-02-14',
  8. '2022-02-17', '2022-02-17',
  9. '2022-02-18', '2022-02-19',
  10. '2022-02-21', '2022-02-21',
  11. '2022-02-22', '2022-02-22']
  12. }
  13. test = pd.DataFrame(data).astype({'time': 'datetime64[ns]'})
  14. #rollingg = test.rolling(window=5)
  15. #(
  16. # test.assign(result= rollingg.ID.count())
  17. #)
  18. # Here, we group every 5 days together and we count, if you need the type
  19. # into account, just add it to the groupby
  20. (
  21. test
  22. .groupby([pd.Grouper(key='time', freq='5D', closed='left')])
  23. .agg(counted=pd.NamedAgg(column='ID', aggfunc='count'))
  24. )

如何计算滚动窗口中的最大出现次数?

I hope this helps!

答案2

得分: 1

你可以使用 pivot_table 计算密集矩阵(日期,类型),然后使用 resample 填充缺失的日期。最后,沿着索引轴应用滚动求和:

  1. df['Date of incident'] = pd.to_datetime(df['Date of incident'])
  2. out = (df.pivot_table(index='Date of incident', columns='Type',
  3. values='Incident ID', aggfunc='count')
  4. .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

编辑: 我认为 pd.crosstab 可能比 pd.pivot_table 更快:

  1. out = (pd.crosstab(df['Date of incident'], df['Type'])
  2. .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

输出:

  1. >>> out
  2. Type A B
  3. Date of incident
  4. 2022-02-12 1 2
  5. 2022-02-13 1 3
  6. 2022-02-14 4 3
  7. 2022-02-15 4 3
  8. 2022-02-16 5 3
  9. 2022-02-17 6 3 # A: 在 2022-02-12 到 2022-02-17 之间的总数为 6
  10. 2022-02-18 5 1 # A: 在 2022-02-13 到 2022-02-18 之间的总数为 5
  11. 2022-02-19 9 0

注意: 计算所有组合可能会是一个耗时的过程。

有了这个形状,你可以轻松地绘制你的数据:

  1. out.plot(figsize=(6, 4), title='Rolling count (5 days)',
  2. ylabel='Number of incident', xlabel='Date')
  3. plt.tight_layout()
  4. plt.show()

如何计算滚动窗口中的最大出现次数?

英文:

You can use pivot_table to compute the dense matrix (Date, Type) then resample to fill missing dates. Finally apply a rolling sum along index axis:

  1. df['Date of incident'] = pd.to_datetime(df['Date of incident'])
  2. out = (df.pivot_table(index='Date of incident', columns='Type',
  3. values='Incident ID', aggfunc='count')
  4. .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

EDIT: I think pd.crosstab may be faster than pd.pivot_table:

  1. out = (pd.crosstab(df['Date of incident'], df['Type'])
  2. .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

Output:

  1. >>> out
  2. Type A B
  3. Date of incident
  4. 2022-02-12 1 2
  5. 2022-02-13 1 3
  6. 2022-02-14 4 3
  7. 2022-02-15 4 3
  8. 2022-02-16 5 3
  9. 2022-02-17 6 3 # A: 6 between 2022-02-12 and 2022-02-17 included
  10. 2022-02-18 5 1 # A: 5 between 2022-02-13 and 2022-02-18 included
  11. 2022-02-19 9 0

Note: compute all combinations can be a heavy process.

With this shape, you can plot your data easily:

  1. out.plot(figsize=(6, 4), title='Rolling count (5 days)',
  2. ylabel='Number of incident', xlabel='Date')
  3. plt.tight_layout()
  4. plt.show()

如何计算滚动窗口中的最大出现次数?

答案3

得分: 1

以下是翻译好的代码部分:

  1. # 示例数据
  2. data = {"Type": list("A" * 10) + list("B" * 10),
  3. "Incident_ID": np.arange(20),
  4. "Date": ['2022-02-12', '2022-02-13', '2022-02-14', '2022-02-14', '2022-02-14', '2022-02-15', '2022-02-15',
  5. '2022-02-16', '2022-02-17', '2022-02-18', '2022-02-19', '2022-02-19', '2022-02-19',
  6. '2022-02-20', '2022-02-21', '2022-02-22', '2022-02-23', '2022-02-24', '2022-02-25', '2022-02-26']}
  7. df = pd.DataFrame(data)
  8. print(df.head())
  9. # 步骤1
  10. df1 = df.groupby(["Type", "Date"], as_index=False).size().rename(columns={"size": "No_of_incidents"})
  11. print(df1.head())
  12. # 仅针对 Type-A:
  13. df2 = df1[df1['Type'] == "A"].assign(rolling_ID1=df1['No_of_incidents'].rolling(5).sum(),
  14. rolling_ID2=df1.iloc[1:, :]['No_of_incidents'].rolling(5).sum().reset_index(drop=True))
  15. # 最后,我添加了 max_incidents 列,显示 rolling_ID1 和 ID2 之间的最大值。
  16. df2['max_incidents'] = df2[['rolling_ID1', 'rolling_ID2']].max(axis=1)
  17. print(df2.head())

请注意,此翻译保留了代码的结构和格式,并将代码中的标识符翻译成了中文。

英文:

Here's my approach: Firstly, I have grouped the data on the basis of incident type and date and then add the incidents.

  1. # Sample data
  2. data={"Type":list("A"*10)+list("B"*10),
  3. "Incident_ID":np.arange(20),
  4. "Date":['2022-02-12', '2022-02-13', '2022-02-14','2022-02-14','2022-02-14','2022-02-15','2022-02-15',
  5. '2022-02-16', '2022-02-17', '2022-02-18', '2022-02-19','2022-02-19','2022-02-19',
  6. '2022-02-20', '2022-02-21', '2022-02-22', '2022-02-23',
  7. '2022-02-24', '2022-02-25', '2022-02-26']}
  8. df=pd.DataFrame(data)
  9. print(df.head())
  10. Type Incident_ID Date
  11. 0 A 0 2022-02-12
  12. 1 A 1 2022-02-13
  13. 2 A 2 2022-02-14
  14. 3 A 3 2022-02-14
  15. 4 A 4 2022-02-14
  16. # Step 1
  17. df1=df.groupby(["Type","Date"],as_index=False).size().rename(columns={"size":"No_of_incidents"})
  18. print(df1.head())
  19. Type Date No_of_incidents
  20. 0 A 2022-02-12 1
  21. 1 A 2022-02-13 1
  22. 2 A 2022-02-14 3
  23. 3 A 2022-02-15 2

Now I have created two columns rolling_ID1(sum of first 5) and rolling_ID2(sum of next 5). The rolling_ID2 is shifted 1 step up in order to match rolling_ID1.

  1. # Only for Type-A:
  2. df2=df1[df1['Type']=="A"].assign(rolling_ID1=df1['No_of_incidents'].rolling(5).sum(),
  3. rolling_ID2=df1.iloc[1:,:]['No_of_incidents'].rolling(5).sum().reset_index(drop=True))

Finally, I'm adding max_incidents column that shows max value between rolling_ID1 & ID2.

  1. df2['max_incidents']=df2[['rolling_ID1','rolling_ID2']].max(axis=1)
  2. print(df2.head())
  3. Type Date No_of_incidents rolling_ID1 rolling_ID2 max_incidents
  4. 0 A 2022-02-12 1 NaN NaN NaN
  5. 1 A 2022-02-13 1 NaN NaN NaN
  6. 2 A 2022-02-14 3 NaN NaN NaN
  7. 3 A 2022-02-15 2 NaN NaN NaN
  8. 4 A 2022-02-16 1 8.0 8.0 8.0
  9. 5 A 2022-02-17 1 8.0 8.0 8.0
  10. 6 A 2022-02-18 1 8.0 8.0 8.0

huangapple
  • 本文由 发表于 2023年6月16日 03:09:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76484819.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定