如何计算滚动窗口中的最大出现次数?

huangapple go评论70阅读模式
英文:

How to calculate the maximum occurance in a rolling window?

问题

以下是翻译好的内容:

假设我有一个如下的数据框:

--------------------------------------------------
| 类型        | 事件ID          | 事件日期        |
--------------------------------------------------
| A          | 1               | 2022-02-12      |
| A          | 2               | 2022-02-14      |
| A          | 3               | 2022-02-14      |
| A          | 4               | 2022-02-14      |
| A          | 5               | 2022-02-16      |
| A          | 6               | 2022-02-17      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |

 ...          ...               ...             

| B          | 1               | 2022-02-12      |
| B          | 2               | 2022-02-12      |
| B          | 3               | 2022-02-13      |

 ...          ...               ...             

--------------------------------------------------

这是一个不同类型的事件列表。每个事件都有一个类型、一个ID和发生日期。这只是一个示例,以帮助理解我的目标。

我想要的是 - 在给定的时间范围内,例如5天 - 滚动累积这些事件的最大值会变成多少:

所以我将从落入前5天的所有元素开始,累积事件发生次数:6。

2022-02-12 - 2022-02-17: 6

通过从一天开始滚动窗口,第一天的所有元素都从总数中消除,这种情况下是-1,下一天也没有添加任何元素。下一个值将是5。

2022-02-13 - 2022-02-18: 5

6 > 5。因此,6仍然是5天窗口内事件发生的最大次数。

继续进行整个时间范围。

这并不难实现,但如何以非常高效的方式处理数百万个元素呢?简而言之:我想创建一个固定日期范围(例如5天)的移动窗口,计算此窗口内的所有事件发生次数,并输出达到的最大值。

英文:

Say I have a data frame as follows:

--------------------------------------------------
| Type       | Incident ID     | Date of incident|
--------------------------------------------------
| A          | 1               | 2022-02-12      |
| A          | 2               | 2022-02-14      |
| A          | 3               | 2022-02-14      |
| A          | 4               | 2022-02-14      |
| A          | 5               | 2022-02-16      |
| A          | 6               | 2022-02-17      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |

 ...          ...               ...             

| B          | 1               | 2022-02-12      |
| B          | 2               | 2022-02-12      |
| B          | 3               | 2022-02-13      |

 ...          ...               ...             

--------------------------------------------------

This is a list of different types of incidents. Every incident has a type, an id and a date, at which it occurred. This is just an example to help understand my goal.

What I want is - for a given range, e.g. 5 days - the maximum value that a rolling sum over these incidents would become:

So I would start with all elements that fall into the first 5 days and accumulate the occurences: 6.

2022-02-12 - 2022-02-17:    6

By starting to roll the window by one day, all elements of the first day get eliminated from the sum, in this case -1 and no element for the next day in line gets added. The next value would be 5.

2022-02-13 - 2022-02-18:    5

6 > 5. So 6 is still the maximum occurence of incidents in a 5 day window.

Continue for the complete time range.

This is not that hard to achieve but how would I do this in a very efficient manner for millions of elements? In short: I want to create a moving window of a fixed date range (e.g. 5 days), count all occurances for this window and give out the maximum value that was reached for any window.

答案1

得分: 1

我已经进行了一些研究,似乎pd.rolling(window=5)在处理大型数据集,特别是多个列的情况下成本相对较高。

然而,我认为pd.Grouper()是你所需要的。

这是我编写的代码片段:

import pandas as pd

data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 
        'type': ['A','A','A','A','B','B','B','B','C','C','C','C','C','C'],
        'time': [
                '2022-02-12', '2022-02-13',
                '2022-02-14', '2022-02-14',
                '2022-02-14', '2022-02-14',
                '2022-02-17', '2022-02-17',
                '2022-02-18', '2022-02-19',
                '2022-02-21', '2022-02-21',
                '2022-02-22', '2022-02-22']
}

test = pd.DataFrame(data).astype({'time': 'datetime64[ns]'})

#rollingg = test.rolling(window=5)
#(
#   test.assign(result= rollingg.ID.count())
#)

# 在这里,我们将每5天的数据分组在一起,然后计数,如果需要考虑类型,请添加到groupby中
(
   test
   .groupby([pd.Grouper(key='time', freq='5D', closed='left')])
   .agg(counted=pd.NamedAgg(column='ID', aggfunc='count'))
) 

希望这对你有帮助!

英文:

I have done some research and it seems that pd.rolling(window=5) is quite costly when it comes to big datasets and especially on multiple columns.

However, I believe that pd.Grouper() is what you need.

here is the snippet code I did

import pandas as pd

data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 
    'type': ['A','A','A','A','B','B','B','B','C','C','C','C','C','C'],
    'time': [
            '2022-02-12', '2022-02-13',
            '2022-02-14', '2022-02-14',
            '2022-02-14', '2022-02-14',
            '2022-02-17', '2022-02-17',
            '2022-02-18', '2022-02-19',
            '2022-02-21', '2022-02-21',
            '2022-02-22', '2022-02-22']
            }
test = pd.DataFrame(data).astype({'time': 'datetime64[ns]'})

#rollingg = test.rolling(window=5)
#(
#   test.assign(result= rollingg.ID.count())
#)

# Here, we group every 5 days together and we count, if you need the type 
#  into account, just add it to the groupby
(
   test
   .groupby([pd.Grouper(key='time', freq='5D', closed='left')])
   .agg(counted=pd.NamedAgg(column='ID', aggfunc='count'))
)    

如何计算滚动窗口中的最大出现次数?

I hope this helps!

答案2

得分: 1

你可以使用 pivot_table 计算密集矩阵(日期,类型),然后使用 resample 填充缺失的日期。最后,沿着索引轴应用滚动求和:

df['Date of incident'] = pd.to_datetime(df['Date of incident'])

out = (df.pivot_table(index='Date of incident', columns='Type',
                      values='Incident ID', aggfunc='count')
         .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

编辑: 我认为 pd.crosstab 可能比 pd.pivot_table 更快:

out = (pd.crosstab(df['Date of incident'], df['Type'])
         .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

输出:

>>> out
Type              A  B
Date of incident          
2022-02-12        1  2
2022-02-13        1  3
2022-02-14        4  3
2022-02-15        4  3
2022-02-16        5  3
2022-02-17        6  3  # A: 在 2022-02-12 到 2022-02-17 之间的总数为 6
2022-02-18        5  1  # A: 在 2022-02-13 到 2022-02-18 之间的总数为 5
2022-02-19        9  0

注意: 计算所有组合可能会是一个耗时的过程。

有了这个形状,你可以轻松地绘制你的数据:

out.plot(figsize=(6, 4), title='Rolling count (5 days)',
         ylabel='Number of incident', xlabel='Date')
plt.tight_layout()
plt.show()

如何计算滚动窗口中的最大出现次数?

英文:

You can use pivot_table to compute the dense matrix (Date, Type) then resample to fill missing dates. Finally apply a rolling sum along index axis:

df['Date of incident'] = pd.to_datetime(df['Date of incident'])

out = (df.pivot_table(index='Date of incident', columns='Type',
                      values='Incident ID', aggfunc='count')
         .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

EDIT: I think pd.crosstab may be faster than pd.pivot_table:

out = (pd.crosstab(df['Date of incident'], df['Type'])
         .resample('D').sum().rolling('5D', closed='both').sum().astype(int))

Output:

>>> out
Type              A  B
Date of incident          
2022-02-12        1  2
2022-02-13        1  3
2022-02-14        4  3
2022-02-15        4  3
2022-02-16        5  3
2022-02-17        6  3  # A: 6 between 2022-02-12 and 2022-02-17 included
2022-02-18        5  1  # A: 5 between 2022-02-13 and 2022-02-18 included
2022-02-19        9  0

Note: compute all combinations can be a heavy process.

With this shape, you can plot your data easily:

out.plot(figsize=(6, 4), title='Rolling count (5 days)',
         ylabel='Number of incident', xlabel='Date')
plt.tight_layout()
plt.show()

如何计算滚动窗口中的最大出现次数?

答案3

得分: 1

以下是翻译好的代码部分:

# 示例数据
data = {"Type": list("A" * 10) + list("B" * 10),
        "Incident_ID": np.arange(20),
        "Date": ['2022-02-12', '2022-02-13', '2022-02-14', '2022-02-14', '2022-02-14', '2022-02-15', '2022-02-15',
                 '2022-02-16', '2022-02-17', '2022-02-18', '2022-02-19', '2022-02-19', '2022-02-19',
                 '2022-02-20', '2022-02-21', '2022-02-22', '2022-02-23', '2022-02-24', '2022-02-25', '2022-02-26']}
df = pd.DataFrame(data)
print(df.head())

# 步骤1
df1 = df.groupby(["Type", "Date"], as_index=False).size().rename(columns={"size": "No_of_incidents"})
print(df1.head())

# 仅针对 Type-A:
df2 = df1[df1['Type'] == "A"].assign(rolling_ID1=df1['No_of_incidents'].rolling(5).sum(),
                                      rolling_ID2=df1.iloc[1:, :]['No_of_incidents'].rolling(5).sum().reset_index(drop=True))

# 最后,我添加了 max_incidents 列,显示 rolling_ID1 和 ID2 之间的最大值。
df2['max_incidents'] = df2[['rolling_ID1', 'rolling_ID2']].max(axis=1)
print(df2.head())

请注意,此翻译保留了代码的结构和格式,并将代码中的标识符翻译成了中文。

英文:

Here's my approach: Firstly, I have grouped the data on the basis of incident type and date and then add the incidents.

# Sample data
data={"Type":list("A"*10)+list("B"*10),
"Incident_ID":np.arange(20),
"Date":['2022-02-12', '2022-02-13', '2022-02-14','2022-02-14','2022-02-14','2022-02-15','2022-02-15',
                '2022-02-16', '2022-02-17', '2022-02-18', '2022-02-19','2022-02-19','2022-02-19',
                '2022-02-20', '2022-02-21', '2022-02-22', '2022-02-23',
                '2022-02-24', '2022-02-25', '2022-02-26']}
df=pd.DataFrame(data)
print(df.head())

   Type	Incident_ID	      Date
0	A	   0	    2022-02-12
1	A	   1	    2022-02-13
2	A	   2	    2022-02-14
3	A	   3	    2022-02-14
4	A	   4	    2022-02-14

# Step 1
df1=df.groupby(["Type","Date"],as_index=False).size().rename(columns={"size":"No_of_incidents"})
print(df1.head())

  Type	      Date	No_of_incidents
0	A	2022-02-12	     1
1	A	2022-02-13	     1
2	A	2022-02-14	     3
3	A	2022-02-15	     2

Now I have created two columns rolling_ID1(sum of first 5) and rolling_ID2(sum of next 5). The rolling_ID2 is shifted 1 step up in order to match rolling_ID1.

# Only for Type-A:
df2=df1[df1['Type']=="A"].assign(rolling_ID1=df1['No_of_incidents'].rolling(5).sum(),
          rolling_ID2=df1.iloc[1:,:]['No_of_incidents'].rolling(5).sum().reset_index(drop=True))

Finally, I'm adding max_incidents column that shows max value between rolling_ID1 & ID2.

df2['max_incidents']=df2[['rolling_ID1','rolling_ID2']].max(axis=1)
print(df2.head())

  Type	      Date	No_of_incidents	rolling_ID1	rolling_ID2	max_incidents
0	A	2022-02-12	       1	            NaN	       NaN	      NaN
1	A	2022-02-13	       1	            NaN        NaN	      NaN
2	A	2022-02-14	       3	            NaN	       NaN	      NaN
3	A	2022-02-15	       2	            NaN   	   NaN	      NaN
4	A	2022-02-16	       1	            8.0	       8.0	      8.0
5	A	2022-02-17	       1	            8.0	       8.0	      8.0
6	A	2022-02-18	       1	            8.0	       8.0	      8.0

huangapple
  • 本文由 发表于 2023年6月16日 03:09:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76484819.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定