确定数据框中各行之间的过渡和持续时间。

huangapple go评论56阅读模式
英文:

Determine transitions and durations across rows in a data frame

问题

以下是您要求的代码部分的翻译:

import pandas as pd
from numpy import nan

df = pd.DataFrame(data={
    "x1": [3, 3, 3, 3, 3, 3],
    "x2": [3, 3, 2, 2, 1, 3],
    "x3": [3, 2, 2, 3, 2, nan],
    "x4": [3, 2, 1, 2, 3, nan],
    "x5": [3, 2, 3, 2, 2, nan],
    "x6": [2, 1, nan, 2, 2, nan],
    "x7": [2, 2, nan, 2, 2, nan]
})

每一行代表一个id,每一列代表一个月的状态。为了简单起见,您可以假设每个id从状态3开始,并且每个月可以将其状态更改为1、2或nan(表示删除id)。

对于每个id,我想确定哪些id直接从3变为2,并计算它们在状态2下停留的时间长度。

预期结果:

out = pd.Series([2, 3, 2, 1, 0, 0])

我希望在纯粹的pandas中实现这个结果,并且在代码复杂性和时间方面超越我的解决方案。

我迄今为止的解决方案如下:

import numba

@numba.njit
def _get_duration(l):
    counter = 0
    for i in range(1, len(l)):
        cond = l[i-1] == 3
        # If state remains in 3, just continue
        if cond and l[i] == 3:
            continue
        # If state changes from 3 to 2 set counter
        elif cond and l[i] == 2:
            counter = 1
        # If state remains in 2, increase counter
        elif l[i-1] == 2 and l[i] == 2:
            counter +=1
        else:
            break
    return counter

@numba.njit
def get_stage2_duration(stg):
    N = stg.shape[0]
    return [_get_duration(stg[i]) for i in range(N)]

产生以下结果:

get_stage2_duration(df.values) # [2, 3, 2, 1, 0, 0]

希望这对您有所帮助。如果您有任何其他问题,请随时提出。

英文:

Consider the following data frame:

import pandas as pd
from numpy import nan

df = pd.DataFrame(data={
    "x1": [3, 3, 3, 3, 3, 3],
    "x2": [3, 3, 2, 2, 1, 3],
    "x3": [3, 2, 2, 3, 2, nan],
    "x4": [3, 2, 1, 2, 3, nan],
    "x5": [3, 2, 3, 2, 2, nan],
    "x6": [2, 1, nan, 2, 2, nan],
    "x7": [2, 2, nan, 2, 2, nan]
})

Each row represents an id and each column a state for a given month. For the sake of simplicity, you can assume that each id starts in state 3 and can change its state each month, either to 1, 2 or nan (which means the id is deleted).

For each id, I want to determine which ids change their states directly from 3 to 2 and how long they remain in 2.

Expected result:

out = pd.Series([2, 3, 2, 1, 0, 0])

I want to achieve this result in pure pandas and beat my solution in terms of code complexity and time.

My solution so far:

import numba

@numba.njit
def _get_duration(l):
    counter = 0
    for i in range(1, len(l)):
        cond = l[i-1] == 3
        # If state remains in 3, just continue
        if cond and l[i] == 3:
            continue
        # If state changes from 3 to 2 set counter
        elif cond and l[i] == 2:
            counter = 1
        # If state remains in 2, increase counter
        elif l[i-1] == 2 and l[i] == 2:
            counter +=1
        else:
            break
    return counter


@numba.njit
def get_stage2_duration(stg):
    N = stg.shape[0]
    return [_get_duration(stg[i]) for i in range(N)]

Yields the following result:

get_stage2_duration(df.values) # [2, 3, 2, 1, 0, 0]

答案1

得分: 2

以下是您要翻译的内容:

另一种使用掩码的方法。

从我们要计算所有跟随数字3的2的情况开始:

is2 = df.eq(2)
was3 = df.mask(is2).ffill(axis=1).eq(3)
out = (is2 & was3).sum(axis=1)
# [2, 3, 2, 5, 3, 0]

在求和之前的中间 out (TrueFalse):

  x1 x2 x3 x4 x5 x6 x7
0              
1              
2              
3              
4              
5              

现在,如果我们只想要第一组连续的2,让我们屏蔽其他的:

was2 = df.shift(axis=1).eq(2)
(is2 & was3).mask((was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]

在求和之前的中间 out(没有字符是 NaN):

  x1 x2 x3 x4 x5 x6 x7
0              
1                
2                  
3                   
4                  
5              

替代方案

# 这个值是2吗?
is2 = df.eq(2)
# 直接前面的值是2吗?
was2 = is2.shift(axis=1, fill_value=False)
# 前一个非2是3吗?
was3 = df.mask(is2).ffill(axis=1).eq(3)
# 组合条件
out = ((was3 & is2) & ~(was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]

在求和之前的中间 out

  x1 x2 x3 x4 x5 x6 x7
0              
1              
2              
3              
4              
5              
英文:

Another approach using masks.

Starting with the case in which we would count all 2s that follow a 3:

is2 = df.eq(2)
was3 = df.mask(is2).ffill(axis=1).eq(3)
out = (is2 & was3).sum(axis=1)
# [2, 3, 2, 5, 3, 0]

Intermediate out before summing (True is , False is ):

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■  □  □
2  □  ■  ■  □  □  □  □
3  □  ■  □  ■  ■  ■  ■
4  □  □  □  □  ■  ■  ■
5  □  □  □  □  □  □  □

Now, if we only want the first stretch of 2s, let's mask the others

was2 = df.shift(axis=1).eq(2)
(is2 & was3).mask((was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]

Intermediate out before summing (no character is NaN):

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■      
2  □  ■  ■            
3  □  ■               
4  □  □  □            
5  □  □  □  □  □  □  □

alternative

# is the value a 2?
is2 = df.eq(2)
# is the immediate preceding value a 2?
was2 = is2.shift(axis=1, fill_value=False)
# was the previous non-2 a 3?
was3 = df.mask(is2).ffill(axis=1).eq(3)
# combine conditions
out = ((was3 & is2) & ~(was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]

Intermediate out before summing:

  x1 x2 x3 x4 x5 x6 x7
0  □  □  □  □  □  ■  ■
1  □  □  ■  ■  ■  □  □
2  □  ■  ■  □  □  □  □
3  □  ■  □  □  □  □  □
4  □  □  □  □  □  □  □
5  □  □  □  □  □  □  □

答案2

得分: 1

下面是你提供的代码的翻译部分:

**示例**

`df`

    	x1	x2	x3	x4	x5	x6	x7
    0	3	3	3.0	3.0	3.0	2.0	2.0
    1	3	3	2.0	2.0	2.0	1.0	2.0
    2	3	2	2.0	1.0	3.0	NaN	NaN
    3	3	2	3.0	2.0	2.0	2.0	2.0
    4	3	1	2.0	3.0	2.0	2.0	2.0
    5	3	3	NaN	NaN	NaN	NaN	NaN


**步骤1**

将前3个连续出现的数字3转换为NaN

    df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]

`df1`

        x1	x2	x3	x4	x5	x6	x7
    0	NaN	NaN	NaN	NaN	NaN	2.0	2.0
    1	NaN	NaN	2.0	2.0	2.0	1.0	2.0
    2	NaN	2.0	2.0	1.0	3.0	NaN	NaN
    3	NaN	2.0	3.0	2.0	2.0	2.0	2.0
    4	NaN	1.0	2.0	3.0	2.0	2.0	2.0
    5	NaN	NaN	NaN	NaN	NaN	NaN	NaN



**步骤2**

找到df1中不是(NaN或2)的部分

将第一次出现之后的所有内容更改为NaN


    df2 = df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)]

`df2`

        x1	x2	x3	x4	x5	x6	x7
    0	NaN	NaN	NaN	NaN	NaN	2.0	2.0
    1	NaN	NaN	2.0	2.0	2.0	NaN	NaN
    2	NaN	2.0	2.0	NaN	NaN	NaN	NaN
    3	NaN	2.0	NaN	NaN	NaN	NaN	NaN
    4	NaN	NaN	NaN	NaN	NaN	NaN	NaN
    5	NaN	NaN	NaN	NaN	NaN	NaN	NaN


**最终**

按行计算非空值的数量

    df2.count(axis=1)

输出:

    0    2
    1    3
    2    2
    3    1
    4    0
    5    0



**完整代码**

df2用于更好地理解完整代码如下

    df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
    df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)].count(axis=1)

请注意,我只提供了代码的翻译,没有包含问题的回答。如果需要进一步的解释或其他帮助,请随时提出。

英文:

Example

df

	x1	x2	x3	x4	x5	x6	x7
0	3	3	3.0	3.0	3.0	2.0	2.0
1	3	3	2.0	2.0	2.0	1.0	2.0
2	3	2	2.0	1.0	3.0	NaN	NaN
3	3	2	3.0	2.0	2.0	2.0	2.0
4	3	1	2.0	3.0	2.0	2.0	2.0
5	3	3	NaN	NaN	NaN	NaN	NaN

Step1

convert to NaN all first 3 streak

df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]

df1

    x1	x2	x3	x4	x5	x6	x7
0	NaN	NaN	NaN	NaN	NaN	2.0	2.0
1	NaN	NaN	2.0	2.0	2.0	1.0	2.0
2	NaN	2.0	2.0	1.0	3.0	NaN	NaN
3	NaN	2.0	3.0	2.0	2.0	2.0	2.0
4	NaN	1.0	2.0	3.0	2.0	2.0	2.0
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Step2

Find the part of df1 that is not (NaN or 2).

Change all occurrences after the first occurrence to NaN

df2 = df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)]

df2

    x1	x2	x3	x4	x5	x6	x7
0	NaN	NaN	NaN	NaN	NaN	2.0	2.0
1	NaN	NaN	2.0	2.0	2.0	NaN	NaN
2	NaN	2.0	2.0	NaN	NaN	NaN	NaN
3	NaN	2.0	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Final

count non-null by row

df2.count(axis=1)

output:

0    2
1    3
2    2
3    1
4    0
5    0

Full Code

df2 is made for better understanding, and the full code is as follows

df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)].count(axis=1)

huangapple
  • 本文由 发表于 2023年5月13日 18:47:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76242318.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定