检查连续日期之间满足相同条件的 N 个列,并返回每个组的列数和ID。

huangapple go评论64阅读模式
英文:

Check N number of columns where same condition met between consecutive dates and return number and ID of columns for each group

问题

以下是翻译好的部分:

对于这个数据框,我想知道每台**机器**在最新日期上创建了多少**传感器**的**高低**情况。

**MAC1**的**SENS_1**值今天(2023-06-09)大于昨天(2023-06-08)的值。同样,对于**SENS_2**、**SENS_3**和**SENS_N**也是如此。

因此,我需要返回一个包含满足以下条件的机器名称和传感器数量的数据框(df)列表:

    | 机器 | 传感器 | 传感器数量 |
    |:---- |:------:| -----:|
    | MAC1  | SENS_1,SENS_2,SENS_3,SENS_N    | 4 |
    | MAC2  | SENS_1,SENS_3,SENS_4    | 3 |
    | MAC3  | SENS_1,SENS_2,SENS_3,SENS_4,SENS_N    | 5 |
    | MAC4 | SENS_1,SENS_2,SENS_3,SENS_4,SENS_N    | 5 |
    | MAC5 | SENS_1,SENS_2,SENS_3,SENS_N    | 4 |
    | MACN | SENS_1,SENS_N    | 2 |

我可以使用以下代码来逐个识别具有高低值的传感器:

    high_lows = df.loc[(df['MAC1'] == df['MAC1'].shift(-1)) & (df['SENS_5'] < df['SENS_5'].shift(-1)) ,'MAC1']

但是有近**500台机器和150个传感器**。

因此,我认为需要迭代列,选择MAC列后的传感器列:

    df.iloc[:,2:]

并检查条件。

*但是此函数将每2000毫秒(2秒)运行一次,因此必须以最快的方式完成。*

在迭代N列以及提取匹配条件的组的情况下,最有效的方法是什么?

*带有NaN的示例数据:*

               日期    MAC     SENS_1   SENS_2     SENS_3      SENS_4     SENS_5
    82   2023-06-08  MAC41  79.608346      NaN  66.181849  149.590573  12.135441
    83   2023-06-09  MAC41  80.655564      NaN  67.636943  109.034809  12.278676

请注意,我已经删除了代码部分,只返回了翻译的内容。

英文:
          Date   MAC     SENS_1   SENS_2     SENS_3      SENS_4     SENS_N
0   2023-06-08  MAC1  79.608346  64.2030  66.181849  149.590573  12.135441
1   2023-06-09  MAC1  80.655564  64.6430  67.636943  109.034809  12.278676
2   2023-06-08  MAC2   5.323341   4.7814  57.231430   51.422297   0.494853
3   2023-06-09  MAC2   5.375561   4.7788  63.701169   88.638871   0.471265
4   2023-06-08  MAC3   3.221366   3.0242  61.071023  113.812544   0.251471
5   2023-06-09  MAC3   3.277577   3.0300  67.280632  133.127433   0.274765
6   2023-06-08  MAC4   3.029806   2.6720  63.310941   72.074345   0.393294
7   2023-06-09  MAC4   3.129871   2.6850  72.727142  100.133454   0.408941
8   2023-06-08  MAC5  18.987204  18.1676  57.660920   98.760819   1.219735
9   2023-06-09  MAC5  19.158136  18.1766  61.598102   93.546477   1.270471
10  2023-06-08  MACN  38.190817  35.2416  68.589171  160.597307   2.531353
11  2023-06-09  MACN  38.393878  34.3196  67.710298  139.854140   2.675000

For this dataframe I would like to know how many SENSORS for each MACHINE creating a HIGH-LOW situation for the latest date.

SENS_1 value of MAC1 for today (2023-06-09) is greater than yesterday's (2023-06-08) value. Also it's the same for SENS_2, SENS_3 and SENS_N

So what I need to return is a list(df) of MACHINEs with the name and numbers of SENSORS that meet the condition:

| MACHINE | SENSORS | N. of SENSORS |
|:---- |:------:| -----:|
| MAC1  | SENS_1,SENS_2,SENS_3,SENS_N    | 4 |
| MAC2  | SENS_1,SENS_3,SENS_4    | 3 |
| MAC3  | SENS_1,SENS_2,SENS_3,SENS_4,SENS_N    | 5 |
| MAC4 | SENS_1,SENS_2,SENS_3,SENS_4,SENS_N    | 5 |
| MAC5 | SENS_1,SENS_2,SENS_3,SENS_N    | 4 |
| MACN | SENS_1,SENS_N    | 2 |

I can use:

high_lows = df.loc[(df['MAC1'] == df['MAC1'].shift(-1)) & (df['SENS_5'] < df['SENS_5'].shift(-1)) ,'MAC1']

to identify SENSORS with high-lows one by one.
However there are almost 500 machines and 150 sensors.

So I think I need to iterate over columns by selecting SENSORS after MAC column:

df.iloc[:,2:]

and check for the condition.

However this function will run every 2000ms(2secs) so it has to be done with the fastest way possible.

What is the most efficient way of iterating over N columns for N groups and extract groups that match the condition?

Example data with NaNs:

           Date    MAC     SENS_1   SENS_2     SENS_3      SENS_4     SENS_5
82   2023-06-08  MAC41  79.608346      NaN  66.181849  149.590573  12.135441
83   2023-06-09  MAC41  80.655564      NaN  67.636943  109.034809  12.278676

答案1

得分: 2

你可以使用以下代码:

def count(df):
    m = df.iloc[:, 2:].diff().fillna(True).gt(0).all(axis=0)
    return pd.Series({'Sensors': ', '.join(m.index[m]), 'Count': m.sum()})

out = df.groupby('MAC', as_index=False).apply(count)

输出:

>>> out
    MAC                                 Sensors  Count
0  MAC1          SENS_1, SENS_2, SENS_3, SENS_N      4
1  MAC2                  SENS_1, SENS_3, SENS_4      3
2  MAC3  SENS_1, SENS_2, SENS_3, SENS_4, SENS_N      5
3  MAC4  SENS_1, SENS_2, SENS_3, SENS_4, SENS_N      5
4  MAC5          SENS_1, SENS_2, SENS_3, SENS_N      4
5  MACN                          SENS_1, SENS_N      2

或者使用另一种方法(假设每台机器只有今天和昨天的数据):

def count(df):
    m = df.iloc[1, 2:] - df.iloc[0, 2:] > 0
    return pd.Series({'Sensors': ', '.join(m.index[m]), 'Count': m.sum()})

out = df.groupby('MAC', as_index=False).apply(count)

请注意,这些代码是用于处理数据的示例代码,具体的数据处理流程需要根据你的数据和需求进行调整。

英文:

You can use:

def count(df):
    m = df.iloc[:, 2:].diff().fillna(True).gt(0).all(axis=0)
    return pd.Series({'Sensors': ', '.join(m.index[m]), 'Count': m.sum()})

out = df.groupby('MAC', as_index=False).apply(count)

Output:

>>> out
    MAC                                 Sensors  Count
0  MAC1          SENS_1, SENS_2, SENS_3, SENS_N      4
1  MAC2                  SENS_1, SENS_3, SENS_4      3
2  MAC3  SENS_1, SENS_2, SENS_3, SENS_4, SENS_N      5
3  MAC4  SENS_1, SENS_2, SENS_3, SENS_4, SENS_N      5
4  MAC5          SENS_1, SENS_2, SENS_3, SENS_N      4
5  MACN                          SENS_1, SENS_N      2

Alternative (as you have only today and yesterday rows per machine):

def count(df):
    m = df.iloc[1, 2:] - df.iloc[0, 2:] > 0
    return pd.Series({'Sensors': ', '.join(m.index[m]), 'Count': m.sum()})

out = df.groupby('MAC', as_index=False).apply(count)

答案2

得分: 1

columns = df.columns[2:]

diffs = (
   df.groupby('MAC', sort=False)[columns]
     .diff().dropna().gt(0)     
     .where(lambda df: ~df, columns.to_series(), axis=1) # 用列名替换`True`
     .replace(False, float('nan'))
)   

count = diffs.count(axis=1)
str_concat = diffs.agg(lambda cols: ','.join(cols.dropna()), axis=1)

out = df.loc[count.index, ['MAC']]

out['SENSORS'] = str_concat
out['N. OF SENSORS'] = count
英文:
columns = df.columns[2:]

diffs = (
   df.groupby('MAC', sort=False)[columns]
     .diff().dropna().gt(0)     
     .where(lambda df: ~df, columns.to_series(), axis=1) # Replace `True` with column name
     .replace(False, float('nan'))
)   

count = diffs.count(axis=1)
str_concat = diffs.agg(lambda cols: ','.join(cols.dropna()), axis=1)

out = df.loc[count.index, ['MAC']]

out['SENSORS'] = str_concat
out['N. OF SENSORS'] = count
     MAC                             SENSORS  N. OF SENSORS
1   MAC1         SENS_1,SENS_2,SENS_3,SENS_N              4
3   MAC2                SENS_1,SENS_3,SENS_4              3
5   MAC3  SENS_1,SENS_2,SENS_3,SENS_4,SENS_N              5
7   MAC4  SENS_1,SENS_2,SENS_3,SENS_4,SENS_N              5
9   MAC5         SENS_1,SENS_2,SENS_3,SENS_N              4
11  MACN                       SENS_1,SENS_N              2

huangapple
  • 本文由 发表于 2023年6月11日 19:58:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76450371.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定