英文:
Determine transitions and durations across rows in a data frame
问题
以下是您要求的代码部分的翻译:
import pandas as pd
from numpy import nan
df = pd.DataFrame(data={
"x1": [3, 3, 3, 3, 3, 3],
"x2": [3, 3, 2, 2, 1, 3],
"x3": [3, 2, 2, 3, 2, nan],
"x4": [3, 2, 1, 2, 3, nan],
"x5": [3, 2, 3, 2, 2, nan],
"x6": [2, 1, nan, 2, 2, nan],
"x7": [2, 2, nan, 2, 2, nan]
})
每一行代表一个id,每一列代表一个月的状态。为了简单起见,您可以假设每个id从状态3开始,并且每个月可以将其状态更改为1、2或nan(表示删除id)。
对于每个id,我想确定哪些id直接从3变为2,并计算它们在状态2下停留的时间长度。
预期结果:
out = pd.Series([2, 3, 2, 1, 0, 0])
我希望在纯粹的pandas中实现这个结果,并且在代码复杂性和时间方面超越我的解决方案。
我迄今为止的解决方案如下:
import numba
@numba.njit
def _get_duration(l):
counter = 0
for i in range(1, len(l)):
cond = l[i-1] == 3
# If state remains in 3, just continue
if cond and l[i] == 3:
continue
# If state changes from 3 to 2 set counter
elif cond and l[i] == 2:
counter = 1
# If state remains in 2, increase counter
elif l[i-1] == 2 and l[i] == 2:
counter +=1
else:
break
return counter
@numba.njit
def get_stage2_duration(stg):
N = stg.shape[0]
return [_get_duration(stg[i]) for i in range(N)]
产生以下结果:
get_stage2_duration(df.values) # [2, 3, 2, 1, 0, 0]
希望这对您有所帮助。如果您有任何其他问题,请随时提出。
英文:
Consider the following data frame:
import pandas as pd
from numpy import nan
df = pd.DataFrame(data={
"x1": [3, 3, 3, 3, 3, 3],
"x2": [3, 3, 2, 2, 1, 3],
"x3": [3, 2, 2, 3, 2, nan],
"x4": [3, 2, 1, 2, 3, nan],
"x5": [3, 2, 3, 2, 2, nan],
"x6": [2, 1, nan, 2, 2, nan],
"x7": [2, 2, nan, 2, 2, nan]
})
Each row represents an id and each column a state for a given month. For the sake of simplicity, you can assume that each id starts in state 3 and can change its state each month, either to 1, 2 or nan (which means the id is deleted).
For each id, I want to determine which ids change their states directly from 3 to 2 and how long they remain in 2.
Expected result:
out = pd.Series([2, 3, 2, 1, 0, 0])
I want to achieve this result in pure pandas and beat my solution in terms of code complexity and time.
My solution so far:
import numba
@numba.njit
def _get_duration(l):
counter = 0
for i in range(1, len(l)):
cond = l[i-1] == 3
# If state remains in 3, just continue
if cond and l[i] == 3:
continue
# If state changes from 3 to 2 set counter
elif cond and l[i] == 2:
counter = 1
# If state remains in 2, increase counter
elif l[i-1] == 2 and l[i] == 2:
counter +=1
else:
break
return counter
@numba.njit
def get_stage2_duration(stg):
N = stg.shape[0]
return [_get_duration(stg[i]) for i in range(N)]
Yields the following result:
get_stage2_duration(df.values) # [2, 3, 2, 1, 0, 0]
答案1
得分: 2
以下是您要翻译的内容:
另一种使用掩码的方法。
从我们要计算所有跟随数字3的2的情况开始:
is2 = df.eq(2)
was3 = df.mask(is2).ffill(axis=1).eq(3)
out = (is2 & was3).sum(axis=1)
# [2, 3, 2, 5, 3, 0]
在求和之前的中间 out
(True
是 ■
,False
是 □
):
x1 x2 x3 x4 x5 x6 x7
0 □ □ □ □ □ ■ ■
1 □ □ ■ ■ ■ □ □
2 □ ■ ■ □ □ □ □
3 □ ■ □ ■ ■ ■ ■
4 □ □ □ □ ■ ■ ■
5 □ □ □ □ □ □ □
现在,如果我们只想要第一组连续的2,让我们屏蔽其他的:
was2 = df.shift(axis=1).eq(2)
(is2 & was3).mask((was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]
在求和之前的中间 out
(没有字符是 NaN
):
x1 x2 x3 x4 x5 x6 x7
0 □ □ □ □ □ ■ ■
1 □ □ ■ ■ ■
2 □ ■ ■
3 □ ■
4 □ □ □
5 □ □ □ □ □ □ □
替代方案
# 这个值是2吗?
is2 = df.eq(2)
# 直接前面的值是2吗?
was2 = is2.shift(axis=1, fill_value=False)
# 前一个非2是3吗?
was3 = df.mask(is2).ffill(axis=1).eq(3)
# 组合条件
out = ((was3 & is2) & ~(was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]
在求和之前的中间 out
:
x1 x2 x3 x4 x5 x6 x7
0 □ □ □ □ □ ■ ■
1 □ □ ■ ■ ■ □ □
2 □ ■ ■ □ □ □ □
3 □ ■ □ □ □ □ □
4 □ □ □ □ □ □ □
5 □ □ □ □ □ □ □
英文:
Another approach using masks.
Starting with the case in which we would count all 2s that follow a 3:
is2 = df.eq(2)
was3 = df.mask(is2).ffill(axis=1).eq(3)
out = (is2 & was3).sum(axis=1)
# [2, 3, 2, 5, 3, 0]
Intermediate out
before summing (True
is ■
, False
is □
):
x1 x2 x3 x4 x5 x6 x7
0 □ □ □ □ □ ■ ■
1 □ □ ■ ■ ■ □ □
2 □ ■ ■ □ □ □ □
3 □ ■ □ ■ ■ ■ ■
4 □ □ □ □ ■ ■ ■
5 □ □ □ □ □ □ □
Now, if we only want the first stretch of 2s, let's mask the others
was2 = df.shift(axis=1).eq(2)
(is2 & was3).mask((was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]
Intermediate out
before summing (no character is NaN
):
x1 x2 x3 x4 x5 x6 x7
0 □ □ □ □ □ ■ ■
1 □ □ ■ ■ ■
2 □ ■ ■
3 □ ■
4 □ □ □
5 □ □ □ □ □ □ □
alternative
# is the value a 2?
is2 = df.eq(2)
# is the immediate preceding value a 2?
was2 = is2.shift(axis=1, fill_value=False)
# was the previous non-2 a 3?
was3 = df.mask(is2).ffill(axis=1).eq(3)
# combine conditions
out = ((was3 & is2) & ~(was2 & ~is2).cummax(axis=1)).sum(axis=1)
# [2, 3, 2, 1, 0, 0]
Intermediate out
before summing:
x1 x2 x3 x4 x5 x6 x7
0 □ □ □ □ □ ■ ■
1 □ □ ■ ■ ■ □ □
2 □ ■ ■ □ □ □ □
3 □ ■ □ □ □ □ □
4 □ □ □ □ □ □ □
5 □ □ □ □ □ □ □
答案2
得分: 1
下面是你提供的代码的翻译部分:
**示例**
`df`
x1 x2 x3 x4 x5 x6 x7
0 3 3 3.0 3.0 3.0 2.0 2.0
1 3 3 2.0 2.0 2.0 1.0 2.0
2 3 2 2.0 1.0 3.0 NaN NaN
3 3 2 3.0 2.0 2.0 2.0 2.0
4 3 1 2.0 3.0 2.0 2.0 2.0
5 3 3 NaN NaN NaN NaN NaN
**步骤1**
将前3个连续出现的数字3转换为NaN
df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
`df1`
x1 x2 x3 x4 x5 x6 x7
0 NaN NaN NaN NaN NaN 2.0 2.0
1 NaN NaN 2.0 2.0 2.0 1.0 2.0
2 NaN 2.0 2.0 1.0 3.0 NaN NaN
3 NaN 2.0 3.0 2.0 2.0 2.0 2.0
4 NaN 1.0 2.0 3.0 2.0 2.0 2.0
5 NaN NaN NaN NaN NaN NaN NaN
**步骤2**
找到df1中不是(NaN或2)的部分。
将第一次出现之后的所有内容更改为NaN
df2 = df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)]
`df2`
x1 x2 x3 x4 x5 x6 x7
0 NaN NaN NaN NaN NaN 2.0 2.0
1 NaN NaN 2.0 2.0 2.0 NaN NaN
2 NaN 2.0 2.0 NaN NaN NaN NaN
3 NaN 2.0 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN
**最终**
按行计算非空值的数量
df2.count(axis=1)
输出:
0 2
1 3
2 2
3 1
4 0
5 0
**完整代码**
df2用于更好地理解,完整代码如下
df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)].count(axis=1)
请注意,我只提供了代码的翻译,没有包含问题的回答。如果需要进一步的解释或其他帮助,请随时提出。
英文:
Example
df
x1 x2 x3 x4 x5 x6 x7
0 3 3 3.0 3.0 3.0 2.0 2.0
1 3 3 2.0 2.0 2.0 1.0 2.0
2 3 2 2.0 1.0 3.0 NaN NaN
3 3 2 3.0 2.0 2.0 2.0 2.0
4 3 1 2.0 3.0 2.0 2.0 2.0
5 3 3 NaN NaN NaN NaN NaN
Step1
convert to NaN all first 3 streak
df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
df1
x1 x2 x3 x4 x5 x6 x7
0 NaN NaN NaN NaN NaN 2.0 2.0
1 NaN NaN 2.0 2.0 2.0 1.0 2.0
2 NaN 2.0 2.0 1.0 3.0 NaN NaN
3 NaN 2.0 3.0 2.0 2.0 2.0 2.0
4 NaN 1.0 2.0 3.0 2.0 2.0 2.0
5 NaN NaN NaN NaN NaN NaN NaN
Step2
Find the part of df1 that is not (NaN or 2).
Change all occurrences after the first occurrence to NaN
df2 = df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)]
df2
x1 x2 x3 x4 x5 x6 x7
0 NaN NaN NaN NaN NaN 2.0 2.0
1 NaN NaN 2.0 2.0 2.0 NaN NaN
2 NaN 2.0 2.0 NaN NaN NaN NaN
3 NaN 2.0 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN
Final
count non-null by row
df2.count(axis=1)
output:
0 2
1 3
2 2
3 1
4 0
5 0
Full Code
df2 is made for better understanding, and the full code is as follows
df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)].count(axis=1)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论