英文:
Pandas label column values between 100 and 160 as Y, and between 100 and 300 as N based on timestamp column
问题
我有一个时间序列数据,每10秒记录一次数据,其中一个列的值在大约2分钟内从100到160的范围内开始,然后一段时间后,另一个范围的值在100到300+之间开始。我想将第一个范围的值标记为Y,第二个范围的值标记为N。
问题在于第二个范围(100-300+)也包括第一个范围(100-160)的值。以下是带有所需Label列的数据快照:
时间戳 数据 标签
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11 Y
2022-09-20 14:59:19.300 125.618 Y
2022-09-20 14:59:29.400 126.108 Y
2022-09-20 14:59:39.400 124.506 Y
2022-09-20 14:59:49.400 124.172 Y
2022-09-20 14:59:59.500 124.528 Y
2022-09-20 15:00:09.500 121.191 Y
2022-09-20 15:00:19.500 113.049 Y
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913 N
2022-09-20 15:05:40.400 152.869 N
2022-09-20 15:05:50.400 161.967 N
2022-09-20 15:06:00.400 166.862 N
2022-09-20 15:08:40.900 294.93 N
2022-09-20 15:08:50.900 280.092 N
2022-09-20 15:09:00.900 261.405 N
2022-09-20 15:09:11.000 237.291 N
2022-09-20 15:09:21.000 219.584 N
2022-09-20 15:09:31.000 191.888 N
2022-09-20 15:09:41.100 172.979 N
2022-09-20 15:09:51.100 144.505 N
2022-09-20 15:10:01.100 125.596 N
2022-09-20 15:10:11.100 102.883 N
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.816
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569
有什么帮助吗?
英文:
I have timeseries data with 10 secondly data, in which a column's values start in the range of 100 and 160 for around 2 minutes and then after some time another range of values start between 100 and 300+, I want to label first range values as Y and second range values as N.
The problem is that second range (100-300+) also includes values of first range (100-160). Below is snap of data with the required Label column:
TimeStamp Data Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11 Y
2022-09-20 14:59:19.300 125.618 Y
2022-09-20 14:59:29.400 126.108 Y
2022-09-20 14:59:39.400 124.506 Y
2022-09-20 14:59:49.400 124.172 Y
2022-09-20 14:59:59.500 124.528 Y
2022-09-20 15:00:09.500 121.191 Y
2022-09-20 15:00:19.500 113.049 Y
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913 N
2022-09-20 15:05:40.400 152.869 N
2022-09-20 15:05:50.400 161.967 N
2022-09-20 15:06:00.400 166.862 N
2022-09-20 15:08:40.900 294.93 N
2022-09-20 15:08:50.900 280.092 N
2022-09-20 15:09:00.900 261.405 N
2022-09-20 15:09:11.000 237.291 N
2022-09-20 15:09:21.000 219.584 N
2022-09-20 15:09:31.000 191.888 N
2022-09-20 15:09:41.100 172.979 N
2022-09-20 15:09:51.100 144.505 N
2022-09-20 15:10:01.100 125.596 N
2022-09-20 15:10:11.100 102.883 N
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.816
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569
Any help please?
答案1
得分: 1
以下是代码部分的翻译:
import pandas as pd
import numpy as np
from pprint import pprint
df = pd.read_csv('2.txt', sep='\s+')
# 我们将所有Label大于等于100的行标记为'Y'。
df['mark'] = np.where(df['Label'] >= 100, 'Y', '')
# 从pandas.Series中选择这些行。
s = df[df['mark'] == 'Y']['Label']
# 我们根据索引的均匀变化将其分成组。
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# 在最大值大于160的组中替换原始数据框中的值。
for name, group in grouped:
if group.max() > 160:
group.loc[:] = 'N'
df['mark'].update(group)
pprint(df)
希望这对你有所帮助。
英文:
As a variant of the solution - splitting into groups by jumps in indexing.
2.txt
TimeStamp Data Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11
2022-09-20 14:59:19.300 125.618
2022-09-20 14:59:29.400 126.108
2022-09-20 14:59:39.400 124.506
2022-09-20 14:59:49.400 124.172
2022-09-20 14:59:59.500 124.528
2022-09-20 15:00:09.500 121.191
2022-09-20 15:00:19.500 113.049
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913
2022-09-20 15:05:40.400 152.869
2022-09-20 15:05:50.400 161.967
2022-09-20 15:06:00.400 166.862
2022-09-20 15:08:40.900 294.930
2022-09-20 15:08:50.900 280.092
2022-09-20 15:09:00.900 261.405
2022-09-20 15:09:11.000 237.291
2022-09-20 15:09:21.000 219.584
2022-09-20 15:09:31.000 191.888
2022-09-20 15:09:41.100 172.979
2022-09-20 15:09:51.100 144.505
2022-09-20 15:10:01.100 125.596
2022-09-20 15:10:11.100 102.883
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.8160
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569
main.py
import pandas as pd
import numpy as np
from pprint import pprint
df = pd.read_csv('2.txt', sep='\s+')
# We mark all lines greater than 100 with the value 'Y' .
df['mark'] = np.where(df['Label'] >= 100, 'Y', '')
# Select these rows from a column in pandas.Series.
s = df[df['mark'] == 'Y']['Label']
# We divide it into groups according to a uniform change in the index.
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# We replace the values in the original dataframe in groups where the maximum value is > 160.
for name, group in grouped:
if group.max() > 160:
group.loc[:] = 'N'
df['mark'].update(group)
pprint(df)
------------------------------
TimeStamp Data Label mark
0 2022-09-20 14:57:28.900 13.65600
1 2022-09-20 14:57:38.900 21.13060
2 2022-09-20 14:58:39.200 75.87700
3 2022-09-20 14:58:49.200 85.39810
4 2022-09-20 14:58:59.200 98.76780
5 2022-09-20 14:59:09.300 107.11000 Y
6 2022-09-20 14:59:19.300 125.61800 Y
7 2022-09-20 14:59:29.400 126.10800 Y
8 2022-09-20 14:59:39.400 124.50600 Y
9 2022-09-20 14:59:49.400 124.17200 Y
10 2022-09-20 14:59:59.500 124.52800 Y
11 2022-09-20 15:00:09.500 121.19100 Y
12 2022-09-20 15:00:19.500 113.04900 Y
13 2022-09-20 15:00:29.500 91.29320
14 2022-09-20 15:00:39.600 76.87810
15 2022-09-20 15:00:49.600 55.47780
16 2022-09-20 15:00:59.600 41.08490
17 2022-09-20 15:02:09.800 8.02791
18 2022-09-20 15:03:00.000 27.27030
19 2022-09-20 15:03:10.000 36.65800
20 2022-09-20 15:04:10.100 83.08460
21 2022-09-20 15:04:20.100 101.49130 N
22 2022-09-20 15:05:40.400 152.86900 N
23 2022-09-20 15:05:50.400 161.96700 N
24 2022-09-20 15:06:00.400 166.86200 N
25 2022-09-20 15:08:40.900 294.93000 N
26 2022-09-20 15:08:50.900 280.09200 N
27 2022-09-20 15:09:00.900 261.40500 N
28 2022-09-20 15:09:11.000 237.29100 N
29 2022-09-20 15:09:21.000 219.58400 N
30 2022-09-20 15:09:31.000 191.88800 N
31 2022-09-20 15:09:41.100 172.97900 N
32 2022-09-20 15:09:51.100 144.50500 N
33 2022-09-20 15:10:01.100 125.59600 N
34 2022-09-20 15:10:11.100 102.88300 N
35 2022-09-20 15:11:11.300 19.68460
36 2022-09-20 15:11:21.400 17.81600
37 2022-09-20 15:11:31.400 27.89320
38 2022-09-20 15:11:41.400 23.15490
39 2022-09-20 15:11:51.400 14.45690
答案2
得分: 0
一个应用于系列的lambda函数在这里发挥作用。
df["new_column"] = df.TimeStamp.apply(lambda x: my_condition(x))
那个 `my_condition` 函数是你可以自己定义的。它可能会像这样:
def my_condition(x):
if x <= my_datetime:
return "Y"
return "N"
或者你可以一行完成所有操作
df["new_column"] = df.TimeStamp.apply(lambda x: "Y" if x <= my_datetime else "N")
**编辑**
-
在问题编辑之后,需要一个新的方法。我们需要能够识别在满足条件 Y(在 100 和 160 之间)的一系列行的第一次出现。
我会通过找到满足条件的第一个实例来解决问题。
idx_start = ((df.Data >= 100) & (df.Data <= 160)).idxmax()
这个方法的工作原理是布尔条件创建一个布尔掩码;每一行都得到 true 或 false。idxmax() 将布尔值转换为整数(1 或 0),并找到最大值的第一个实例(即 true 的第一个实例,也就是第一个 1 的实例)。
接下来,找到条件 *不再满足* 的第一个实例。在这种情况下,它是值不在范围 100-160 内的第一个实例,但前一行在该范围内。
idx_end = ((df.Data < 100) | (df.Data > 160)) & ((df.shift(1).Data >= 100) & (df.shift(1).Data <= 160))
布尔掩码方法是相同的,但使用 shift 函数来识别前一行,以对其应用条件。
一旦你有了起始和结束索引,将行标记为 `Y` 就很容易了。
英文:
A lambda function applied to a series does the trick here.
df["new_column"] = df.TimeStamp.apply(lambda x: my_condition(x))
That my_condition
function is one you can define. It might do something like:
def my_condition(x):
if x <= my_datetime:
return "Y"
return "N"
Or you can do it all in one line
df["new_column"] = df.TimeStamp.apply(lambda x: "Y" if x <= my_datetime else "N")
EDIT
After the question is edited, it's clear a new approach is needed. We need to be able to identify the first occurrence of a sequence of rows for which the condition Y is meet (between 100 and 160).
I'd approach by finding the first instance where the condition is met.
idx_start = ((df.Data >= 100) & (df.Data <= 160)).idxmax()
The way this works is the boolean conditions create a boolean mask; every row gets either true or false. The idxmax() converts the bools to ints (1 or 0) and finds the first instance of the max value (i.e. the first instance of 1, which is the first instance of true)
Next up find the first instance where the condition is no longer met. In this case, it's the first instance where the value is not in the range 100-160, but the previous row IS in the range.
idx_end = ((df.Data < 100) | (df.Data > 160)) & ((df.shift(1).Data >= 100) & (df.shift(1).Data <= 160))
The boolean mask approach is identical, but the shift function is used to identify the previous row, to apply a condition to it.
Once you have the start and end indices, it's trivial to mark the rows as Y
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论