2023年5月13日 18:47:13go评论90阅读模式

英文:

Determine transitions and durations across rows in a data frame

问题

以下是您要求的代码部分的翻译：

import pandas as pd
from numpy import nan
df = pd.DataFrame(data={
    "x1": [3, 3, 3, 3, 3, 3],
    "x2": [3, 3, 2, 2, 1, 3],
    "x3": [3, 2, 2, 3, 2, nan],
    "x4": [3, 2, 1, 2, 3, nan],
    "x5": [3, 2, 3, 2, 2, nan],
    "x6": [2, 1, nan, 2, 2, nan],
    "x7": [2, 2, nan, 2, 2, nan]
})

每一行代表一个id，每一列代表一个月的状态。为了简单起见，您可以假设每个id从状态3开始，并且每个月可以将其状态更改为1、2或nan（表示删除id）。

对于每个id，我想确定哪些id直接从3变为2，并计算它们在状态2下停留的时间长度。

预期结果：

out = pd.Series([2, 3, 2, 1, 0, 0])

我希望在纯粹的pandas中实现这个结果，并且在代码复杂性和时间方面超越我的解决方案。

我迄今为止的解决方案如下：

import numba
@numba.njit
def _get_duration(l):
    counter = 0
    for i in range(1, len(l)):
        cond = l[i-1] == 3
        # If state remains in 3, just continue
        if cond and l[i] == 3:
            continue
        # If state changes from 3 to 2 set counter
        elif cond and l[i] == 2:
            counter = 1
        # If state remains in 2, increase counter
        elif l[i-1] == 2 and l[i] == 2:
            counter +=1
        else:
            break
    return counter
@numba.njit
def get_stage2_duration(stg):
    N = stg.shape[0]
    return [_get_duration(stg[i]) for i in range(N)]

产生以下结果：

get_stage2_duration(df.values) # [2, 3, 2, 1, 0, 0]

希望这对您有所帮助。如果您有任何其他问题，请随时提出。

英文:

Consider the following data frame:

import pandas as pd
from numpy import nan
df = pd.DataFrame(data={
    &quot;x1&quot;: [3, 3, 3, 3, 3, 3],
    &quot;x2&quot;: [3, 3, 2, 2, 1, 3],
    &quot;x3&quot;: [3, 2, 2, 3, 2, nan],
    &quot;x4&quot;: [3, 2, 1, 2, 3, nan],
    &quot;x5&quot;: [3, 2, 3, 2, 2, nan],
    &quot;x6&quot;: [2, 1, nan, 2, 2, nan],
    &quot;x7&quot;: [2, 2, nan, 2, 2, nan]
})

Each row represents an id and each column a state for a given month. For the sake of simplicity, you can assume that each id starts in state 3 and can change its state each month, either to 1, 2 or nan (which means the id is deleted).

For each id, I want to determine which ids change their states directly from 3 to 2 and how long they remain in 2.

Expected result:

out = pd.Series([2, 3, 2, 1, 0, 0])

I want to achieve this result in pure pandas and beat my solution in terms of code complexity and time.

My solution so far:

import numba
@numba.njit
def _get_duration(l):
    counter = 0
    for i in range(1, len(l)):
        cond = l[i-1] == 3
        # If state remains in 3, just continue
        if cond and l[i] == 3:
            continue
        # If state changes from 3 to 2 set counter
        elif cond and l[i] == 2:
            counter = 1
        # If state remains in 2, increase counter
        elif l[i-1] == 2 and l[i] == 2:
            counter +=1
        else:
            break
    return counter
@numba.njit
def get_stage2_duration(stg):
    N = stg.shape[0]
    return [_get_duration(stg[i]) for i in range(N)]

Yields the following result:

get_stage2_duration(df.values) # [2, 3, 2, 1, 0, 0]

答案1

得分: 2

以下是您要翻译的内容：

另一种使用掩码的方法。

从我们要计算所有跟随数字3的2的情况开始：

is2 = df.eq(2)
was3 = df.mask(is2).ffill(axis=1).eq(3)
out = (is2 & was3).sum(axis=1)
# [2, 3, 2, 5, 3, 0]

在求和之前的中间 out (True 是 ■，False 是 □)：

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■  □  □
2  □  ■  ■  □  □  □  □
3  □  ■  □  ■  ■  ■  ■
4  □  □  □  □  ■  ■  ■
5  □  □  □  □  □  □  □

现在，如果我们只想要第一组连续的2，让我们屏蔽其他的：

was2 = df.shift(axis=1).eq(2)
(is2 & was3).mask((was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]

在求和之前的中间 out（没有字符是 NaN）：

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■      
2  □  ■  ■            
3  □  ■               
4  □  □  □            
5  □  □  □  □  □  □  □

替代方案

# 这个值是2吗？
is2 = df.eq(2)
# 直接前面的值是2吗？
was2 = is2.shift(axis=1, fill_value=False)
# 前一个非2是3吗？
was3 = df.mask(is2).ffill(axis=1).eq(3)
# 组合条件
out = ((was3 & is2) & ~(was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]

在求和之前的中间 out：

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■  □  □
2  □  ■  ■  □  □  □  □
3  □  ■  □  □  □  □  □
4  □  □  □  □  □  □  □
5  □  □  □  □  □  □  □

英文:

Another approach using masks.

Starting with the case in which we would count all 2s that follow a 3:

is2 = df.eq(2)
was3 = df.mask(is2).ffill(axis=1).eq(3)
out = (is2 &amp; was3).sum(axis=1)
# [2, 3, 2, 5, 3, 0]

Intermediate out before summing (True is ■, False is □):

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■  □  □
2  □  ■  ■  □  □  □  □
3  □  ■  □  ■  ■  ■  ■
4  □  □  □  □  ■  ■  ■
5  □  □  □  □  □  □  □

Now, if we only want the first stretch of 2s, let's mask the others

was2 = df.shift(axis=1).eq(2)
(is2 &amp; was3).mask((was2 &amp; ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]

Intermediate out before summing (no character is NaN):

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■      
2  □  ■  ■            
3  □  ■               
4  □  □  □            
5  □  □  □  □  □  □  □

alternative

# is the value a 2?
is2 = df.eq(2)
# is the immediate preceding value a 2?
was2 = is2.shift(axis=1, fill_value=False)
# was the previous non-2 a 3?
was3 = df.mask(is2).ffill(axis=1).eq(3)
# combine conditions
out = ((was3 &amp; is2) &amp; ~(was2 &amp; ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]

Intermediate out before summing:

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■  □  □
2  □  ■  ■  □  □  □  □
3  □  ■  □  □  □  □  □
4  □  □  □  □  □  □  □
5  □  □  □  □  □  □  □

答案2

得分: 1

下面是你提供的代码的翻译部分：

**示例**
`df`
    	x1	x2	x3	x4	x5	x6	x7
    0	3	3	3.0	3.0	3.0	2.0	2.0
    1	3	3	2.0	2.0	2.0	1.0	2.0
    2	3	2	2.0	1.0	3.0	NaN	NaN
    3	3	2	3.0	2.0	2.0	2.0	2.0
    4	3	1	2.0	3.0	2.0	2.0	2.0
    5	3	3	NaN	NaN	NaN	NaN	NaN
**步骤1**
将前3个连续出现的数字3转换为NaN
    df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
`df1`
        x1	x2	x3	x4	x5	x6	x7
    0	NaN	NaN	NaN	NaN	NaN	2.0	2.0
    1	NaN	NaN	2.0	2.0	2.0	1.0	2.0
    2	NaN	2.0	2.0	1.0	3.0	NaN	NaN
    3	NaN	2.0	3.0	2.0	2.0	2.0	2.0
    4	NaN	1.0	2.0	3.0	2.0	2.0	2.0
    5	NaN	NaN	NaN	NaN	NaN	NaN	NaN
**步骤2**
找到df1中不是(NaN或2)的部分。
将第一次出现之后的所有内容更改为NaN
    df2 = df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)]
`df2`
        x1	x2	x3	x4	x5	x6	x7
    0	NaN	NaN	NaN	NaN	NaN	2.0	2.0
    1	NaN	NaN	2.0	2.0	2.0	NaN	NaN
    2	NaN	2.0	2.0	NaN	NaN	NaN	NaN
    3	NaN	2.0	NaN	NaN	NaN	NaN	NaN
    4	NaN	NaN	NaN	NaN	NaN	NaN	NaN
    5	NaN	NaN	NaN	NaN	NaN	NaN	NaN
**最终**
按行计算非空值的数量
    df2.count(axis=1)
输出:
    0    2
    1    3
    2    2
    3    1
    4    0
    5    0
**完整代码**
df2用于更好地理解，完整代码如下
    df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
    df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)].count(axis=1)

请注意，我只提供了代码的翻译，没有包含问题的回答。如果需要进一步的解释或其他帮助，请随时提出。

英文:

Example

df

	x1	x2	x3	x4	x5	x6	x7
0	3	3	3.0	3.0	3.0	2.0	2.0
1	3	3	2.0	2.0	2.0	1.0	2.0
2	3	2	2.0	1.0	3.0	NaN	NaN
3	3	2	3.0	2.0	2.0	2.0	2.0
4	3	1	2.0	3.0	2.0	2.0	2.0
5	3	3	NaN	NaN	NaN	NaN	NaN

Step1

convert to NaN all first 3 streak

df1 = df[(df.ne(3) &amp; df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]

df1

    x1	x2	x3	x4	x5	x6	x7
0	NaN	NaN	NaN	NaN	NaN	2.0	2.0
1	NaN	NaN	2.0	2.0	2.0	1.0	2.0
2	NaN	2.0	2.0	1.0	3.0	NaN	NaN
3	NaN	2.0	3.0	2.0	2.0	2.0	2.0
4	NaN	1.0	2.0	3.0	2.0	2.0	2.0
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Step2

Find the part of df1 that is not (NaN or 2).

Change all occurrences after the first occurrence to NaN

df2 = df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)]

df2

    x1	x2	x3	x4	x5	x6	x7
0	NaN	NaN	NaN	NaN	NaN	2.0	2.0
1	NaN	NaN	2.0	2.0	2.0	NaN	NaN
2	NaN	2.0	2.0	NaN	NaN	NaN	NaN
3	NaN	2.0	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Final

count non-null by row

df2.count(axis=1)

output:

Full Code

df2 is made for better understanding, and the full code is as follows

df1 = df[(df.ne(3) &amp; df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)].count(axis=1)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

确定数据框中各行之间的过渡和持续时间。

问题

答案1

替代方案

alternative

答案2

Python TypeError: ‘tuple’ object is not callable when using list comprehension inside for loop

Numpy append尽管指定了轴，但未添加新行。

Listen for keyboard shortcut ESC + ESC then run some code

Tkinter – GUI：带有按钮的复选框，用于检查异常并关闭窗口

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。