2023年3月31日 18:47:43go评论65阅读模式

英文:

Pandas label column values between 100 and 160 as Y, and between 100 and 300 as N based on timestamp column

问题

我有一个时间序列数据，每10秒记录一次数据，其中一个列的值在大约2分钟内从100到160的范围内开始，然后一段时间后，另一个范围的值在100到300+之间开始。我想将第一个范围的值标记为Y，第二个范围的值标记为N。
问题在于第二个范围（100-300+）也包括第一个范围（100-160）的值。以下是带有所需Label列的数据快照：

    时间戳                 数据      标签
    2022-09-20 14:57:28.900	13.656	
    2022-09-20 14:57:38.900	21.1306	
    2022-09-20 14:58:39.200	75.877	
    2022-09-20 14:58:49.200	85.3981	
    2022-09-20 14:58:59.200	98.7678	
    2022-09-20 14:59:09.300	107.11	    Y
    2022-09-20 14:59:19.300	125.618	    Y
    2022-09-20 14:59:29.400	126.108	    Y
    2022-09-20 14:59:39.400	124.506	    Y
    2022-09-20 14:59:49.400	124.172	    Y
    2022-09-20 14:59:59.500	124.528	    Y
    2022-09-20 15:00:09.500	121.191	    Y
    2022-09-20 15:00:19.500	113.049	    Y
    2022-09-20 15:00:29.500	91.2932	
    2022-09-20 15:00:39.600	76.8781	
    2022-09-20 15:00:49.600	55.4778	
    2022-09-20 15:00:59.600	41.0849	
    2022-09-20 15:02:09.800	8.02791	
    2022-09-20 15:03:00.000	27.2703	
    2022-09-20 15:03:10.000	36.658	
    2022-09-20 15:04:10.100	83.0846	
    2022-09-20 15:04:20.100	101.4913	N
    2022-09-20 15:05:40.400	152.869	    N
    2022-09-20 15:05:50.400	161.967	    N
    2022-09-20 15:06:00.400	166.862	    N
    2022-09-20 15:08:40.900	294.93	    N
    2022-09-20 15:08:50.900	280.092	    N
    2022-09-20 15:09:00.900	261.405	    N
    2022-09-20 15:09:11.000	237.291	    N
    2022-09-20 15:09:21.000	219.584	    N
    2022-09-20 15:09:31.000	191.888	    N
    2022-09-20 15:09:41.100	172.979	    N
    2022-09-20 15:09:51.100	144.505	    N
    2022-09-20 15:10:01.100	125.596	    N
    2022-09-20 15:10:11.100	102.883	    N
    2022-09-20 15:11:11.300	19.6846	
    2022-09-20 15:11:21.400	17.816	
    2022-09-20 15:11:31.400	27.8932	
    2022-09-20 15:11:41.400	23.1549	
    2022-09-20 15:11:51.400	14.4569

有什么帮助吗？

英文:

I have timeseries data with 10 secondly data, in which a column's values start in the range of 100 and 160 for around 2 minutes and then after some time another range of values start between 100 and 300+, I want to label first range values as Y and second range values as N.
The problem is that second range (100-300+) also includes values of first range (100-160). Below is snap of data with the required Label column:

TimeStamp	Data	Label
2022-09-20 14:57:28.900	13.656	
2022-09-20 14:57:38.900	21.1306	
2022-09-20 14:58:39.200	75.877	
2022-09-20 14:58:49.200	85.3981	
2022-09-20 14:58:59.200	98.7678	
2022-09-20 14:59:09.300	107.11	Y
2022-09-20 14:59:19.300	125.618	Y
2022-09-20 14:59:29.400	126.108	Y
2022-09-20 14:59:39.400	124.506	Y
2022-09-20 14:59:49.400	124.172	Y
2022-09-20 14:59:59.500	124.528	Y
2022-09-20 15:00:09.500	121.191	Y
2022-09-20 15:00:19.500	113.049	Y
2022-09-20 15:00:29.500	91.2932	
2022-09-20 15:00:39.600	76.8781	
2022-09-20 15:00:49.600	55.4778	
2022-09-20 15:00:59.600	41.0849	
2022-09-20 15:02:09.800	8.02791	
2022-09-20 15:03:00.000	27.2703	
2022-09-20 15:03:10.000	36.658	
2022-09-20 15:04:10.100	83.0846	
2022-09-20 15:04:20.100	101.4913	N
2022-09-20 15:05:40.400	152.869	N
2022-09-20 15:05:50.400	161.967	N
2022-09-20 15:06:00.400	166.862	N
2022-09-20 15:08:40.900	294.93	N
2022-09-20 15:08:50.900	280.092	N
2022-09-20 15:09:00.900	261.405	N
2022-09-20 15:09:11.000	237.291	N
2022-09-20 15:09:21.000	219.584	N
2022-09-20 15:09:31.000	191.888	N
2022-09-20 15:09:41.100	172.979	N
2022-09-20 15:09:51.100	144.505	N
2022-09-20 15:10:01.100	125.596	N
2022-09-20 15:10:11.100	102.883	N
2022-09-20 15:11:11.300	19.6846	
2022-09-20 15:11:21.400	17.816	
2022-09-20 15:11:31.400	27.8932	
2022-09-20 15:11:41.400	23.1549	
2022-09-20 15:11:51.400	14.4569

Any help please?

答案1

得分: 1

以下是代码部分的翻译：

import pandas as pd
import numpy as np
from pprint import pprint

df = pd.read_csv('2.txt', sep='\s+')
# 我们将所有Label大于等于100的行标记为'Y'。
df['mark'] = np.where(df['Label'] >= 100, 'Y', '')
# 从pandas.Series中选择这些行。
s = df[df['mark'] == 'Y']['Label']
# 我们根据索引的均匀变化将其分成组。
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# 在最大值大于160的组中替换原始数据框中的值。
for name, group in grouped:
    if group.max() > 160:
        group.loc[:] = 'N'
        df['mark'].update(group)
pprint(df)

希望这对你有所帮助。

英文:

As a variant of the solution - splitting into groups by jumps in indexing.

2.txt

TimeStamp   Data    Label
2022-09-20 14:57:28.900 13.656
2022-09-20 14:57:38.900 21.1306
2022-09-20 14:58:39.200 75.877
2022-09-20 14:58:49.200 85.3981
2022-09-20 14:58:59.200 98.7678
2022-09-20 14:59:09.300 107.11
2022-09-20 14:59:19.300 125.618
2022-09-20 14:59:29.400 126.108
2022-09-20 14:59:39.400 124.506
2022-09-20 14:59:49.400 124.172
2022-09-20 14:59:59.500 124.528
2022-09-20 15:00:09.500 121.191
2022-09-20 15:00:19.500 113.049
2022-09-20 15:00:29.500 91.2932
2022-09-20 15:00:39.600 76.8781
2022-09-20 15:00:49.600 55.4778
2022-09-20 15:00:59.600 41.0849
2022-09-20 15:02:09.800 8.02791
2022-09-20 15:03:00.000 27.2703
2022-09-20 15:03:10.000 36.658
2022-09-20 15:04:10.100 83.0846
2022-09-20 15:04:20.100 101.4913
2022-09-20 15:05:40.400 152.869
2022-09-20 15:05:50.400 161.967
2022-09-20 15:06:00.400 166.862
2022-09-20 15:08:40.900 294.930
2022-09-20 15:08:50.900 280.092
2022-09-20 15:09:00.900 261.405
2022-09-20 15:09:11.000 237.291
2022-09-20 15:09:21.000 219.584
2022-09-20 15:09:31.000 191.888
2022-09-20 15:09:41.100 172.979
2022-09-20 15:09:51.100 144.505
2022-09-20 15:10:01.100 125.596
2022-09-20 15:10:11.100 102.883
2022-09-20 15:11:11.300 19.6846
2022-09-20 15:11:21.400 17.8160
2022-09-20 15:11:31.400 27.8932
2022-09-20 15:11:41.400 23.1549
2022-09-20 15:11:51.400 14.4569

main.py

import pandas as pd
import numpy as np
from pprint import pprint


df = pd.read_csv(&#39;2.txt&#39;, sep=&#39;\s+&#39;)
# We mark all lines greater than 100 with the value &#39;Y&#39; .
df[&#39;mark&#39;] = np.where(df[&#39;Label&#39;] &gt;= 100, &#39;Y&#39;, &#39;&#39;)
# Select these rows from a column in pandas.Series.
s = df[df[&#39;mark&#39;] == &#39;Y&#39;][&#39;Label&#39;]
# We divide it into groups according to a uniform change in the index.
grouped = s.groupby(s.index.to_series().diff().ne(1).cumsum())
# We replace the values in the original dataframe in groups where the maximum value is &gt; 160.
for name, group in grouped:
    if group.max() &gt; 160:
        group.loc[:] = &#39;N&#39;
        df[&#39;mark&#39;].update(group)
pprint(df)

------------------------------

     TimeStamp          Data      Label mark
0   2022-09-20  14:57:28.900   13.65600     
1   2022-09-20  14:57:38.900   21.13060     
2   2022-09-20  14:58:39.200   75.87700     
3   2022-09-20  14:58:49.200   85.39810     
4   2022-09-20  14:58:59.200   98.76780     
5   2022-09-20  14:59:09.300  107.11000    Y
6   2022-09-20  14:59:19.300  125.61800    Y
7   2022-09-20  14:59:29.400  126.10800    Y
8   2022-09-20  14:59:39.400  124.50600    Y
9   2022-09-20  14:59:49.400  124.17200    Y
10  2022-09-20  14:59:59.500  124.52800    Y
11  2022-09-20  15:00:09.500  121.19100    Y
12  2022-09-20  15:00:19.500  113.04900    Y
13  2022-09-20  15:00:29.500   91.29320     
14  2022-09-20  15:00:39.600   76.87810     
15  2022-09-20  15:00:49.600   55.47780     
16  2022-09-20  15:00:59.600   41.08490     
17  2022-09-20  15:02:09.800    8.02791     
18  2022-09-20  15:03:00.000   27.27030     
19  2022-09-20  15:03:10.000   36.65800     
20  2022-09-20  15:04:10.100   83.08460     
21  2022-09-20  15:04:20.100  101.49130    N
22  2022-09-20  15:05:40.400  152.86900    N
23  2022-09-20  15:05:50.400  161.96700    N
24  2022-09-20  15:06:00.400  166.86200    N
25  2022-09-20  15:08:40.900  294.93000    N
26  2022-09-20  15:08:50.900  280.09200    N
27  2022-09-20  15:09:00.900  261.40500    N
28  2022-09-20  15:09:11.000  237.29100    N
29  2022-09-20  15:09:21.000  219.58400    N
30  2022-09-20  15:09:31.000  191.88800    N
31  2022-09-20  15:09:41.100  172.97900    N
32  2022-09-20  15:09:51.100  144.50500    N
33  2022-09-20  15:10:01.100  125.59600    N
34  2022-09-20  15:10:11.100  102.88300    N
35  2022-09-20  15:11:11.300   19.68460     
36  2022-09-20  15:11:21.400   17.81600     
37  2022-09-20  15:11:31.400   27.89320     
38  2022-09-20  15:11:41.400   23.15490     
39  2022-09-20  15:11:51.400   14.45690

答案2

得分: 0

一个应用于系列的lambda函数在这里发挥作用。

df["new_column"] = df.TimeStamp.apply(lambda x: my_condition(x))

那个 `my_condition` 函数是你可以自己定义的。它可能会像这样：

def my_condition(x):
    if x <= my_datetime:
        return "Y"
    return "N"

或者你可以一行完成所有操作

df["new_column"] = df.TimeStamp.apply(lambda x: "Y" if x <= my_datetime else "N")

**编辑**
-
在问题编辑之后，需要一个新的方法。我们需要能够识别在满足条件 Y（在 100 和 160 之间）的一系列行的第一次出现。

我会通过找到满足条件的第一个实例来解决问题。

idx_start = ((df.Data >= 100) & (df.Data <= 160)).idxmax()

这个方法的工作原理是布尔条件创建一个布尔掩码；每一行都得到 true 或 false。idxmax() 将布尔值转换为整数（1 或 0），并找到最大值的第一个实例（即 true 的第一个实例，也就是第一个 1 的实例）。

接下来，找到条件 *不再满足* 的第一个实例。在这种情况下，它是值不在范围 100-160 内的第一个实例，但前一行在该范围内。

idx_end = ((df.Data < 100) | (df.Data > 160)) & ((df.shift(1).Data >= 100) & (df.shift(1).Data <= 160))

布尔掩码方法是相同的，但使用 shift 函数来识别前一行，以对其应用条件。

一旦你有了起始和结束索引，将行标记为 `Y` 就很容易了。

英文:

A lambda function applied to a series does the trick here.

df[&quot;new_column&quot;] = df.TimeStamp.apply(lambda x: my_condition(x))

That my_condition function is one you can define. It might do something like:

def my_condition(x):
    if x &lt;= my_datetime:
        return &quot;Y&quot;
    return &quot;N&quot;

Or you can do it all in one line

df[&quot;new_column&quot;] = df.TimeStamp.apply(lambda x: &quot;Y&quot; if x &lt;= my_datetime else &quot;N&quot;)

EDIT

After the question is edited, it's clear a new approach is needed. We need to be able to identify the first occurrence of a sequence of rows for which the condition Y is meet (between 100 and 160).

I'd approach by finding the first instance where the condition is met.

idx_start = ((df.Data &gt;= 100) &amp; (df.Data &lt;= 160)).idxmax()

The way this works is the boolean conditions create a boolean mask; every row gets either true or false. The idxmax() converts the bools to ints (1 or 0) and finds the first instance of the max value (i.e. the first instance of 1, which is the first instance of true)

Next up find the first instance where the condition is no longer met. In this case, it's the first instance where the value is not in the range 100-160, but the previous row IS in the range.

idx_end = ((df.Data &lt; 100) | (df.Data &gt; 160)) &amp; ((df.shift(1).Data &gt;= 100) &amp; (df.shift(1).Data &lt;= 160))

The boolean mask approach is identical, but the shift function is used to identify the previous row, to apply a condition to it.

Once you have the start and end indices, it's trivial to mark the rows as Y

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas label column values between 100 and 160 as Y, and between 100 and 300 as N based on timestamp column

问题

答案1

答案2

EDIT

从.env文件中导出变量并从os.environ中获取它们，不使用python-dotenv。

可以在最后一个整数后拆分pandas列吗？

限制几个轴中的一个轴的范围。

Python：这段Python代码（requests）有什么问题？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论