Pandas label column values between 100 and 160 as Y, and between 100 and 300 as N based on timestamp column

huangapple go评论65阅读模式
英文:

Pandas label column values between 100 and 160 as Y, and between 100 and 300 as N based on timestamp column

问题

我有一个时间序列数据,每10秒记录一次数据,其中一个列的值在大约2分钟内从100到160的范围内开始,然后一段时间后,另一个范围的值在100到300+之间开始。我想将第一个范围的值标记为Y,第二个范围的值标记为N。
问题在于第二个范围(100-300+)也包括第一个范围(100-160)的值。以下是带有所需Label列的数据快照:

    时间戳                 数据      标签
    2022-09-20 14:57:28.900	13.656	
    2022-09-20 14:57:38.900	21.1306	
    2022-09-20 14:58:39.200	75.877	
    2022-09-20 14:58:49.200	85.3981	
    2022-09-20 14:58:59.200	98.7678	
    2022-09-20 14:59:09.300	107.11	    Y
    2022-09-20 14:59:19.300	125.618	    Y
    2022-09-20 14:59:29.400	126.108	    Y
    2022-09-20 14:59:39.400	124.506	    Y
    2022-09-20 14:59:49.400	124.172	    Y
    2022-09-20 14:59:59.500	124.528	    Y
    2022-09-20 15:00:09.500	121.191	    Y
    2022-09-20 15:00:19.500	113.049	    Y
    2022-09-20 15:00:29.500	91.2932	
    2022-09-20 15:00:39.600	76.8781	
    2022-09-20 15:00:49.600	55.4778	
    2022-09-20 15:00:59.600	41.0849	
    2022-09-20 15:02:09.800	8.02791	
    2022-09-20 15:03:00.000	27.2703	
    2022-09-20 15:03:10.000	36.658	
    2022-09-20 15:04:10.100	83.0846	
    2022-09-20 15:04:20.100	101.4913	N
    2022-09-20 15:05:40.400	152.869	    N
    2022-09-20 15:05:50.400	161.967	    N
    2022-09-20 15:06:00.400	166.862	    N
    2022-09-20 15:08:40.900	294.93	    N
    2022-09-20 15:08:50.900	280.092	    N
    2022-09-20 15:09:00.900	261.405	    N
    2022-09-20 15:09:11.000	237.291	    N
    2022-09-20 15:09:21.000	219.584	    N
    2022-09-20 15:09:31.000	191.888	    N
    2022-09-20 15:09:41.100	172.979	    N
    2022-09-20 15:09:51.100	144.505	    N
    2022-09-20 15:10:01.100	125.596	    N
    2022-09-20 15:10:11.100	102.883	    N
    2022-09-20 15:11:11.300	19.6846	
    2022-09-20 15:11:21.400	17.816	
    2022-09-20 15:11:31.400	27.8932	
    2022-09-20 15:11:41.400	23.1549	
    2022-09-20 15:11:51.400	14.4569	

有什么帮助吗?

英文:

I have timeseries data with 10 secondly data, in which a column's values start in the range of 100 and 160 for around 2 minutes and then after some time another range of values start between 100 and 300+, I want to label first range values as Y and second range values as N.
The problem is that second range (100-300+) also includes values of first range (100-160). Below is snap of data with the required Label column:

TimeStamp	Data	Label
2022-09-20 14:57:28.900	13.656	
2022-09-20 14:57:38.900	21.1306	
2022-09-20 14:58:39.200	75.877	
2022-09-20 14:58:49.200	85.3981	
2022-09-20 14:58:59.200	98.7678	
2022-09-20 14:59:09.300	107.11	Y
2022-09-20 14:59:19.300	125.618	Y
2022-09-20 14:59:29.400	126.108	Y
2022-09-20 14:59:39.400	124.506	Y
2022-09-20 14:59:49.400	124.172	Y
2022-09-20 14:59:59.500	124.528	Y
2022-09-20 15:00:09.500	121.191	Y
2022-09-20 15:00:19.500	113.049	Y
2022-09-20 15:00:29.500	91.2932	
2022-09-20 15:00:39.600	76.8781	
2022-09-20 15:00:49.600	55.4778	
2022-09-20 15:00:59.600	41.0849	
2022-09-20 15:02:09.800	8.02791	
2022-09-20 15:03:00.000	27.2703	
2022-09-20 15:03:10.000	36.658	
2022-09-20 15:04:10.100	83.0846	
2022-09-20 15:04:20.100	101.4913	N
2022-09-20 15:05:40.400	152.869	N
2022-09-20 15:05:50.400	161.967	N
2022-09-20 15:06:00.400	166.862	N
2022-09-20 15:08:40.900	294.93	N
2022-09-20 15:08:50.900	280.092	N
2022-09-20 15:09:00.900	261.405	N
2022-09-20 15:09:11.000	237.291	N
2022-09-20 15:09:21.000	219.584	N
2022-09-20 15:09:31.000	191.888	N
2022-09-20 15:09:41.100	172.979	N
2022-09-20 15:09:51.100	144.505	N
2022-09-20 15:10:01.100	125.596	N
2022-09-20 15:10:11.100	102.883	N
2022-09-20 15:11:11.300	19.6846	
2022-09-20 15:11:21.400	17.816	
2022-09-20 15:11:31.400	27.8932	
2022-09-20 15:11:41.400	23.1549	
2022-09-20 15:11:51.400	14.4569	

Any help please?

答案1

得分: 1

以下是代码部分的翻译:

import pandas as pd
import numpy as np
from pprint import pprint

df = pd.read_csv('2.txt', sep='\s+')
# 我们将所有Label大于等于100的行标记为'Y'。
df['mark'] = np.where(df['Label'] >= 100, 'Y', '')
# 从pandas.Series中选择这些行。
s = df[df['mark'] == 'Y']['Label']
# 我们根据索引的均匀变化将其分成组。
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# 在最大值大于160的组中替换原始数据框中的值。
for name, group in grouped:
    if group.max() > 160:
        group.loc[:] = 'N'
        df['mark'].update(group)
pprint(df)

希望这对你有所帮助。

英文:

As a variant of the solution - splitting into groups by jumps in indexing.

2.txt
TimeStamp   Data    Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11
2022-09-20 14:59:19.300 125.618
2022-09-20 14:59:29.400 126.108
2022-09-20 14:59:39.400 124.506
2022-09-20 14:59:49.400 124.172
2022-09-20 14:59:59.500 124.528
2022-09-20 15:00:09.500 121.191
2022-09-20 15:00:19.500 113.049
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913
2022-09-20 15:05:40.400 152.869
2022-09-20 15:05:50.400 161.967
2022-09-20 15:06:00.400 166.862
2022-09-20 15:08:40.900 294.930
2022-09-20 15:08:50.900 280.092
2022-09-20 15:09:00.900 261.405
2022-09-20 15:09:11.000 237.291
2022-09-20 15:09:21.000 219.584
2022-09-20 15:09:31.000 191.888
2022-09-20 15:09:41.100 172.979
2022-09-20 15:09:51.100 144.505
2022-09-20 15:10:01.100 125.596
2022-09-20 15:10:11.100 102.883
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.8160
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569
main.py
import pandas as pd
import numpy as np
from pprint import pprint


df = pd.read_csv('2.txt', sep='\s+')
# We mark all lines greater than 100 with the value 'Y' .
df['mark'] = np.where(df['Label'] >= 100, 'Y', '')
# Select these rows from a column in pandas.Series.
s = df[df['mark'] == 'Y']['Label']
# We divide it into groups according to a uniform change in the index.
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# We replace the values in the original dataframe in groups where the maximum value is > 160.
for name, group in grouped:
    if group.max() > 160:
        group.loc[:] = 'N'
        df['mark'].update(group)
pprint(df)

------------------------------

     TimeStamp          Data      Label mark
0   2022-09-20  14:57:28.900   13.65600     
1   2022-09-20  14:57:38.900   21.13060     
2   2022-09-20  14:58:39.200   75.87700     
3   2022-09-20  14:58:49.200   85.39810     
4   2022-09-20  14:58:59.200   98.76780     
5   2022-09-20  14:59:09.300  107.11000    Y
6   2022-09-20  14:59:19.300  125.61800    Y
7   2022-09-20  14:59:29.400  126.10800    Y
8   2022-09-20  14:59:39.400  124.50600    Y
9   2022-09-20  14:59:49.400  124.17200    Y
10  2022-09-20  14:59:59.500  124.52800    Y
11  2022-09-20  15:00:09.500  121.19100    Y
12  2022-09-20  15:00:19.500  113.04900    Y
13  2022-09-20  15:00:29.500   91.29320     
14  2022-09-20  15:00:39.600   76.87810     
15  2022-09-20  15:00:49.600   55.47780     
16  2022-09-20  15:00:59.600   41.08490     
17  2022-09-20  15:02:09.800    8.02791     
18  2022-09-20  15:03:00.000   27.27030     
19  2022-09-20  15:03:10.000   36.65800     
20  2022-09-20  15:04:10.100   83.08460     
21  2022-09-20  15:04:20.100  101.49130    N
22  2022-09-20  15:05:40.400  152.86900    N
23  2022-09-20  15:05:50.400  161.96700    N
24  2022-09-20  15:06:00.400  166.86200    N
25  2022-09-20  15:08:40.900  294.93000    N
26  2022-09-20  15:08:50.900  280.09200    N
27  2022-09-20  15:09:00.900  261.40500    N
28  2022-09-20  15:09:11.000  237.29100    N
29  2022-09-20  15:09:21.000  219.58400    N
30  2022-09-20  15:09:31.000  191.88800    N
31  2022-09-20  15:09:41.100  172.97900    N
32  2022-09-20  15:09:51.100  144.50500    N
33  2022-09-20  15:10:01.100  125.59600    N
34  2022-09-20  15:10:11.100  102.88300    N
35  2022-09-20  15:11:11.300   19.68460     
36  2022-09-20  15:11:21.400   17.81600     
37  2022-09-20  15:11:31.400   27.89320     
38  2022-09-20  15:11:41.400   23.15490     
39  2022-09-20  15:11:51.400   14.45690     

答案2

得分: 0

一个应用于系列的lambda函数在这里发挥作用

df["new_column"] = df.TimeStamp.apply(lambda x: my_condition(x))

那个 `my_condition` 函数是你可以自己定义的它可能会像这样

def my_condition(x):
    if x <= my_datetime:
        return "Y"
    return "N"

或者你可以一行完成所有操作

df["new_column"] = df.TimeStamp.apply(lambda x: "Y" if x <= my_datetime else "N")

**编辑**
-
在问题编辑之后需要一个新的方法我们需要能够识别在满足条件 Y100160 之间的一系列行的第一次出现

我会通过找到满足条件的第一个实例来解决问题

idx_start = ((df.Data >= 100) & (df.Data <= 160)).idxmax()

这个方法的工作原理是布尔条件创建一个布尔掩码每一行都得到 true 或 falseidxmax() 将布尔值转换为整数10),并找到最大值的第一个实例即 true 的第一个实例也就是第一个 1 的实例)。

接下来找到条件 *不再满足* 的第一个实例在这种情况下它是值不在范围 100-160 内的第一个实例但前一行在该范围内

idx_end = ((df.Data < 100) | (df.Data > 160)) & ((df.shift(1).Data >= 100) & (df.shift(1).Data <= 160))

布尔掩码方法是相同的但使用 shift 函数来识别前一行以对其应用条件

一旦你有了起始和结束索引将行标记为 `Y` 就很容易了
英文:

A lambda function applied to a series does the trick here.

df[&quot;new_column&quot;] = df.TimeStamp.apply(lambda x: my_condition(x))

That my_condition function is one you can define. It might do something like:

def my_condition(x):
    if x &lt;= my_datetime:
        return &quot;Y&quot;
    return &quot;N&quot;

Or you can do it all in one line

df[&quot;new_column&quot;] = df.TimeStamp.apply(lambda x: &quot;Y&quot; if x &lt;= my_datetime else &quot;N&quot;)

EDIT

After the question is edited, it's clear a new approach is needed. We need to be able to identify the first occurrence of a sequence of rows for which the condition Y is meet (between 100 and 160).

I'd approach by finding the first instance where the condition is met.

idx_start = ((df.Data &gt;= 100) &amp; (df.Data &lt;= 160)).idxmax()

The way this works is the boolean conditions create a boolean mask; every row gets either true or false. The idxmax() converts the bools to ints (1 or 0) and finds the first instance of the max value (i.e. the first instance of 1, which is the first instance of true)

Next up find the first instance where the condition is no longer met. In this case, it's the first instance where the value is not in the range 100-160, but the previous row IS in the range.

idx_end = ((df.Data &lt; 100) | (df.Data &gt; 160)) &amp; ((df.shift(1).Data &gt;= 100) &amp; (df.shift(1).Data &lt;= 160)) 

The boolean mask approach is identical, but the shift function is used to identify the previous row, to apply a condition to it.

Once you have the start and end indices, it's trivial to mark the rows as Y

huangapple
  • 本文由 发表于 2023年3月31日 18:47:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75897652.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定