如何更快地迭代这两个非常大的警报数据框?

huangapple go评论68阅读模式
英文:

How can I iterate over these two very large alarm dataframes faster?

问题

我有两个pandas数据框。一个是报警条件的起始事件,另一个是报警条件的结束事件。每个事件都有一个报警ID和消息时间,而停止数据框中有一个终止类型的值。对于起始事件中的每个事件,我想要添加下一个相同报警ID的停止事件以及该停止事件的终止类型。我还想计算起始事件和停止事件之间的报警活动秒数。一些起始事件可能没有相应的停止事件,一些起始事件可能有同时发生的停止事件(在这种情况下,报警活动秒数为零),并且多个起始事件可能共享相同的停止事件。

def alarmActiveTime(start, stop):
    i = 0
    while i < len(start):
        j = stop['Message Time'].searchsorted([start.iloc[i]['Message Time']], side='left')[0]
        match_found = False
        while (j < len(stop)) and (match_found == False):
            if ((stop.iloc[j]['Message Time'] >= start.iloc[i]['Message Time']) and (stop.iloc[j]['Alarm ID'] == start.iloc[i]['Alarm ID'])):
                start.at[i, 'Terminated Time'] = stop.iloc[j]['Message Time']
                start.at[i, 'Termination Type'] = stop.iloc[j]['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i]['Terminated Time'] - start.iloc[i]['Message Time']).total_seconds())
                match_found = True
                j = len(stop)
            else:
                j += 1
        if not match_found:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) or (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.')
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0))
            print('')
        i += 1 
    return start

我有大约57万个起始事件和48万个停止事件,所以运行时间很长。我正在寻找提高这个循环运行时间的方法。

首先,我尝试通过报警ID筛选停止数据框。

def alarmActiveTime(start, stop):
    print('CALCULATING ACTIVE TIME OF ALARM...')
    i = 0
    while i < len(start):
        thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i]['Alarm ID'])]
        j = thisStop['Message Time'].searchsorted([start.iloc[i]['Message Time']], side='left')[0]
        match_found = False
        while (j < len(thisStop)) and (match_found == False):
            if ((thisStop.iloc[j]['Message Time'] >= start.iloc[i]['Message Time']) and (thisStop.iloc[j]['Alarm ID'] == start.iloc[i]['Alarm ID'])):
                start.at[i, 'Terminated Time'] = thisStop.iloc[j]['Message Time']
                start.at[i, 'Termination Type'] = thisStop.iloc[j]['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i]['Terminated Time'] - start.iloc[i]['Message Time']).total_seconds())
                match_found = True
                j = len(thisStop)
            else:
                j += 1
        if not match_found:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) or (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.')
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0))
            print('')
        i += 1 
    print("%s HOURS ELAPSED." % round(((time.time() - start_time)/3600), 2))
    return start

我还尝试通过报警ID和消息时间筛选停止数据框,然后如果结果大于0,则返回第一个元素。

def alarmActiveTime(start, stop):
    print('CALCULATING ACTIVE TIME OF ALARM...')
    i = 0
    while i < len(start):
        thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i]['Alarm ID'])]
        thisStop = thisStop.loc[thisStop['Message Time'] >= start.iloc[i]['Message Time']]
        if (len(thisStop) > 0):
                start.at[i, 'Terminated Time'] = thisStop.iloc[0]['Message Time']
                start.at[i, 'Termination Type'] = thisStop.iloc[0]['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i]['Terminated Time'] - start.iloc[i]['Message Time']).total_seconds())
        else:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) or (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.')
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0))
            print('')
        i += 1 
    return start

这些方法都比在未筛选的停止日志中搜索要快,但当我计时前20000次循环左右时,它们的速度大致相同。我认为随着列表的处理,通过消息时间筛选可能会产生更大的影响。是否还有其他方法可以加快这种循环的运行速度,还是必须要花这么长时间进行这么多比较?

编辑:
我在起始数据框中有14274个唯一的报警ID值。
当我尝试合并数据框时,出现了数组内存错误。
我无法发布报警日志的屏幕截图,但我会在有了一些可以发布的遮蔽内容后发布另一个更新。

英文:

I have two pandas DataFrames. One is alarm condition starting events, and the other is alarm condition ending events. Each event has an Alarm ID and a Message Time, and the stop DataFrame has a value for Termination Type. For each event in start, I want to add the next time the same Alarm ID registers a stop and the Termination Type for that stop event. I also want to calculate the Alarm Active Seconds between the start event and the stop event. Some start events may not have a corresponding stop event, some start events may have a simultaneous stop event (in which case the Alarm Active Seconds is zero), and more than one start event may share the same stop event.

def alarmActiveTime(start, stop):
    i = 0
    while i &lt; len(start):
        j = stop[&#39;Message Time&#39;].searchsorted([start.iloc[i].loc[&#39;Message Time&#39;]], side=&#39;left&#39;)[0]
        match_found = False
        while (j &lt; len(stop)) &amp; (match_found == False):
            if ((stop.iloc[j].loc[&#39;Message Time&#39;] &gt;= start.iloc[i].loc[&#39;Message Time&#39;]) &amp; (stop.iloc[j].loc[&#39;Alarm ID&#39;] == start.iloc[i].loc[&#39;Alarm ID&#39;])):
                start.at[i, &#39;Terminated Time&#39;] = stop.iloc[j].loc[&#39;Message Time&#39;]
                start.at[i, &#39;Termination Type&#39;] = stop.iloc[j].loc[&#39;Termination Type&#39;]
                start.at[i, &#39;Alarm Active Seconds&#39;] = int((start.iloc[i].loc[&#39;Terminated Time&#39;] - start.iloc[i].loc[&#39;Message Time&#39;]).total_seconds())
                match_found = True
                j = len(stop)
            else:
                j += 1
        if not match_found:
            start.at[i, &#39;Terminated Time&#39;] = &#39;undefined&#39;
            start.at[i, &#39;Alarm Active Seconds&#39;] = &#39;undefined&#39;
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f&#39;{i} OF {len(start)} COMPLETE.&#39;) 
            print(&quot;%s SECONDS ELAPSED.&quot; % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print(&#39;&#39;)
        i += 1 
    return start

I have about 570,000 start events and 480,000 stop events, so it takes hours to run. I'm looking for ways to improve the run time of this loop.

First, I tried filtering the stop DataFrame by Alarm ID.

def alarmActiveTime(start, stop):
    print(&#39;CALCULATING ACTIVE TIME OF ALARM...&#39;)
    i = 0
    while i &lt; len(start):
        thisStop = stop.loc[(stop[&#39;Alarm ID&#39;] == start.iloc[i].loc[&#39;Alarm ID&#39;])]
        j = thisStop[&#39;Message Time&#39;].searchsorted([start.iloc[i].loc[&#39;Message Time&#39;]], side=&#39;left&#39;)[0]
        match_found = False # Flag variable to track if a match is found
        while (j &lt; len(thisStop)) &amp; (match_found == False):
            if ((thisStop.iloc[j].loc[&#39;Message Time&#39;] &gt;= start.iloc[i].loc[&#39;Message Time&#39;]) &amp; (thisStop.iloc[j].loc[&#39;Alarm ID&#39;] == start.iloc[i].loc[&#39;Alarm ID&#39;])):
                start.at[i, &#39;Terminated Time&#39;] = thisStop.iloc[j].loc[&#39;Message Time&#39;]
                start.at[i, &#39;Termination Type&#39;] = thisStop.iloc[j].loc[&#39;Termination Type&#39;]
                start.at[i, &#39;Alarm Active Seconds&#39;] = int((start.iloc[i].loc[&#39;Terminated Time&#39;] - start.iloc[i].loc[&#39;Message Time&#39;]).total_seconds())
                match_found = True
                j = len(thisStop)
            else:
                j += 1
        if not match_found:
            start.at[i, &#39;Terminated Time&#39;] = &#39;undefined&#39;
            start.at[i, &#39;Alarm Active Seconds&#39;] = &#39;undefined&#39;
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f&#39;{i} OF {len(start)} COMPLETE.&#39;) 
            print(&quot;%s SECONDS ELAPSED.&quot; % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print(&#39;&#39;)
        i += 1 
    print(&quot;%s HOURS ELAPSED.&quot; % round(((time.time() - start_time)/3600), 2))
    return start

I also tried filtering the stop DataFrame by Alarm ID and Message Time, and then returning the first element if the result was larger than 0.

def alarmActiveTime(start, stop):
    print(&#39;CALCULATING ACTIVE TIME OF ALARM...&#39;)
    i = 0
    while i &lt; len(start):
        thisStop = stop.loc[(stop[&#39;Alarm ID&#39;] == start.iloc[i].loc[&#39;Alarm ID&#39;])]
        thisStop = thisStop.loc[thisStop[&#39;Message Time&#39;] &gt;= start.iloc[i].loc[&#39;Message Time&#39;]]
        if (len(thisStop) &gt; 0):
                start.at[i, &#39;Terminated Time&#39;] = thisStop.iloc[0].loc[&#39;Message Time&#39;]
                start.at[i, &#39;Termination Type&#39;] = thisStop.iloc[0].loc[&#39;Termination Type&#39;]
                start.at[i, &#39;Alarm Active Seconds&#39;] = int((start.iloc[i].loc[&#39;Terminated Time&#39;] - start.iloc[i].loc[&#39;Message Time&#39;]).total_seconds())
        else:
            start.at[i, &#39;Terminated Time&#39;] = &#39;undefined&#39;
            start.at[i, &#39;Alarm Active Seconds&#39;] = &#39;undefined&#39;
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f&#39;{i} OF {len(start)} COMPLETE.&#39;) 
            print(&quot;%s SECONDS ELAPSED.&quot; % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print(&#39;&#39;)
        i += 1 
    return start

These methods were both faster than searching through an unfiltered stop log, but they were roughly as fast as each other when I timed the first 20,000 loops or so. I presume that filtering by Message Time could have a greater impact as I work through the list. Is there anything else I could do to run this kind of loop faster, or is it just going to take that long to do this many comparisons?

Edit:
I have 14274 unique alarm ID values in the start dataframe.
When I tried to merge the dataframes together, I got an array memory error.
I cannot post a screenshot of the alarm logs, but will post another update after I've got something masked that I can post.

答案1

得分: 0

正如@Pep_8_Guardiola所说,在大多数情况下,你不应该遍历pandas数据框。你说两者都有AlarmID。所以我的第一个建议是通过AlarmID将两个数据框合并合并在一起。这样,每一行都会包含开始时间、结束时间以及两个数据框中的所有其他列。

从那里,你应该更容易进行所需的计算。例如:

df['alarm_total_active_time'] = df['alarm_stop_time'] - df['alarm_start_time']

我不太确定之前的代码是否适用于你的数据,因为我不知道它是否以datetime格式存在或者是哪种DTYPE,但类似的方法应该能够工作。(也许可以使用Pandas.Timestamp的Timedelta方法来帮助)

要得到更详细的答案,你应该提供有关你的数据框的更多信息。搜索pandas文档以获取有关如何处理日期和时间戳以及如何在数据框中执行操作的更多信息。正如之前提到的,通常遍历数据框是解决你所面临问题的最差解决方案。

英文:

As @Pep_8_Guardiola said, in most cases you shouldn't iterate over pandas dataframes. You said that both have AlarmID. So my first suggestion would be to Merge both dataframes by AlarmID. By doing this, you would have in each row both the start time, stop time, and all other columns in both DataFrames.

From that it should be way easier to do the calculations you need. For example:

df[&#39;alarm_total_active_time&#39;] = df[&#39;alarm_stop_time&#39;] - df[&#39;alarm_start_time&#39;] 

I'm not entirely sure the previous code works with the data you have, because I don't know if it is in datetime format or which DTYPE, but something similar should work. (Perhaps using the Timedelta method from Pandas.Timestamp could help)

For a more complete answer, you should provide more information on the dataframe you have. Search the pandas documentation for more information on how to handle dates and timestamps and how to perform operations inside the dataframe. As mentioned already, usually iterating over the DF is the worst solution for the problem you have.

答案2

得分: 0

在起始数据框中,我有14274个唯一的告警ID值。当我尝试合并这些数据框时,出现了数组内存错误。由于数据属于机密信息,我无法发布告警日志的截图。然而,我找到了一个解决方案,将函数的运行时间从大约7小时减少到大约30分钟。我不指望需要经常运行这个函数,所以这个运行时间是可以接受的。

首先,我基于'Alarm ID'对数据框进行了筛选。我尝试合并筛选后的数据框,但这没有生成我想要的输出。我使用了apply.lambda()来生成我所需变量的元组。然后,我基于这个元组定义了两个新列,并将筛选后的数据框连接到输出中。

def alarmActiveTime(start, stop):
    output = pd.DataFrame()
    listIDs = start['Alarm ID'].unique().tolist()
    for x in listIDs:
        def termination(time): 
            thisDF = thisStop[(thisStop['Message Time'] >= time)]
            timeList = thisDF['Message Time'].tolist()
            typeList = thisDF['Termination Type'].tolist()
            if len(timeList) > 0:
                timeOutput = timeList[0]
                typeOutput = typeList[0]
            else:
                timeOutput = 'Undefined'
                typeOutput = 'Undefined'
            termTuple = (timeOutput, typeOutput)
            return termTuple
        thisStart = start.loc[start['Alarm ID'] == x]
        thisStop = stop.loc[stop['Alarm ID'] == x]
        thisStart['Termination Tuple'] = thisStart.apply(lambda x: termination(x['Message Time']), axis=1)
        thisStart['Termination Time'] = thisStart['Termination Tuple'].apply(lambda x: x[0])
        thisStart['Termination Type'] = thisStart['Termination Tuple'].apply(lambda x: x[1])
        thisStart.drop('Termination Tuple', axis=1, inplace=True)
        output = pd.concat([output, thisStart])
    return output

我相信这个代码可以进一步优化,但这个版本看起来已经可以处理了。感谢所有提供反馈的人。如果我漏掉了什么,请告诉我。

英文:

I have 14274 unique alarm ID values in the start dataframe. When I tried to merge the dataframes together, I got an array memory error. I couldn't post a screenshot of the alarm logs because the data is confidential. However, I was able to find a solution that cut the run time of the function from ~7 hours to ~30 minutes. I don't expect I will have to run this function very often, so this run time is acceptable.

First I filtered the dataframe based on 'Alarm ID'. I tried merging the filtered dataframes, but that didn't generate the output I was looking for. I used apply.lambda() to generate a tuple of the variables I was looking for. Then I defined two new columns based on the tuple and concatenated the filtered dataframe to the output.

def alarmActiveTime(start, stop):
    output = pd.DataFrame()
    listIDs = start[&#39;Alarm ID&#39;].unique().tolist()
    for x in listIDs:
        def termination(time): #This was broken out into two different functions at first, but by returning a tuple we can get both important values at once without having to run essentially the same loop twice
            thisDF = thisStop[(thisStop[&#39;Message Time&#39;]&gt;= time)]
            timeList = thisDF[&#39;Message Time&#39;].tolist()
            typeList = thisDF[&#39;Termination Type&#39;].tolist()
            if len(timeList) &gt; 0:
                timeOutput = timeList[0]
                typeOutput = typeList[0]
            else:
                timeOutput = &#39;Undefined&#39;
                typeOutput = &#39;Undefined&#39;
            termTuple = (timeOutput, typeOutput)
            return termTuple
        thisStart = start.loc[start[&#39;Alarm ID&#39;] == x]
        thisStop = stop.loc[stop[&#39;Alarm ID&#39;] == x]
        thisStart[&#39;Termination Tuple&#39;] = thisStart.apply(lambda x: termination(x[&#39;Message Time&#39;]), axis=1)
        thisStart[&#39;Termination Time&#39;] = thisStart[&#39;Termination Tuple&#39;].apply(lambda x: x[0])
        thisStart[&#39;Termination Type&#39;] = thisStart[&#39;Termination Tuple&#39;].apply(lambda x: x[1])
        thisStart.drop(&#39;Termination Tuple&#39;, axis=1, inplace=True)
        output = pd.concat([output, thisStart])
    return output

I'm sure this could be further optimized, but this seemed like it was manageable. Thank you everyone who gave feedback. Let me know if I'm missing something.

huangapple
  • 本文由 发表于 2023年6月12日 21:38:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76457231.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定