确定数据框中各行之间的过渡和持续时间。

huangapple go评论90阅读模式
英文:

Determine transitions and durations across rows in a data frame

问题

以下是您要求的代码部分的翻译:

  1. import pandas as pd
  2. from numpy import nan
  3. df = pd.DataFrame(data={
  4. "x1": [3, 3, 3, 3, 3, 3],
  5. "x2": [3, 3, 2, 2, 1, 3],
  6. "x3": [3, 2, 2, 3, 2, nan],
  7. "x4": [3, 2, 1, 2, 3, nan],
  8. "x5": [3, 2, 3, 2, 2, nan],
  9. "x6": [2, 1, nan, 2, 2, nan],
  10. "x7": [2, 2, nan, 2, 2, nan]
  11. })

每一行代表一个id,每一列代表一个月的状态。为了简单起见,您可以假设每个id从状态3开始,并且每个月可以将其状态更改为1、2或nan(表示删除id)。

对于每个id,我想确定哪些id直接从3变为2,并计算它们在状态2下停留的时间长度。

预期结果:

  1. out = pd.Series([2, 3, 2, 1, 0, 0])

我希望在纯粹的pandas中实现这个结果,并且在代码复杂性和时间方面超越我的解决方案。

我迄今为止的解决方案如下:

  1. import numba
  2. @numba.njit
  3. def _get_duration(l):
  4. counter = 0
  5. for i in range(1, len(l)):
  6. cond = l[i-1] == 3
  7. # If state remains in 3, just continue
  8. if cond and l[i] == 3:
  9. continue
  10. # If state changes from 3 to 2 set counter
  11. elif cond and l[i] == 2:
  12. counter = 1
  13. # If state remains in 2, increase counter
  14. elif l[i-1] == 2 and l[i] == 2:
  15. counter +=1
  16. else:
  17. break
  18. return counter
  19. @numba.njit
  20. def get_stage2_duration(stg):
  21. N = stg.shape[0]
  22. return [_get_duration(stg[i]) for i in range(N)]

产生以下结果:

  1. get_stage2_duration(df.values) # [2, 3, 2, 1, 0, 0]

希望这对您有所帮助。如果您有任何其他问题,请随时提出。

英文:

Consider the following data frame:

  1. import pandas as pd
  2. from numpy import nan
  3. df = pd.DataFrame(data={
  4. "x1": [3, 3, 3, 3, 3, 3],
  5. "x2": [3, 3, 2, 2, 1, 3],
  6. "x3": [3, 2, 2, 3, 2, nan],
  7. "x4": [3, 2, 1, 2, 3, nan],
  8. "x5": [3, 2, 3, 2, 2, nan],
  9. "x6": [2, 1, nan, 2, 2, nan],
  10. "x7": [2, 2, nan, 2, 2, nan]
  11. })

Each row represents an id and each column a state for a given month. For the sake of simplicity, you can assume that each id starts in state 3 and can change its state each month, either to 1, 2 or nan (which means the id is deleted).

For each id, I want to determine which ids change their states directly from 3 to 2 and how long they remain in 2.

Expected result:

  1. out = pd.Series([2, 3, 2, 1, 0, 0])

I want to achieve this result in pure pandas and beat my solution in terms of code complexity and time.

My solution so far:

  1. import numba
  2. @numba.njit
  3. def _get_duration(l):
  4. counter = 0
  5. for i in range(1, len(l)):
  6. cond = l[i-1] == 3
  7. # If state remains in 3, just continue
  8. if cond and l[i] == 3:
  9. continue
  10. # If state changes from 3 to 2 set counter
  11. elif cond and l[i] == 2:
  12. counter = 1
  13. # If state remains in 2, increase counter
  14. elif l[i-1] == 2 and l[i] == 2:
  15. counter +=1
  16. else:
  17. break
  18. return counter
  19. @numba.njit
  20. def get_stage2_duration(stg):
  21. N = stg.shape[0]
  22. return [_get_duration(stg[i]) for i in range(N)]

Yields the following result:

  1. get_stage2_duration(df.values) # [2, 3, 2, 1, 0, 0]

答案1

得分: 2

以下是您要翻译的内容:

另一种使用掩码的方法。

从我们要计算所有跟随数字3的2的情况开始:

  1. is2 = df.eq(2)
  2. was3 = df.mask(is2).ffill(axis=1).eq(3)
  3. out = (is2 & was3).sum(axis=1)
  4. # [2, 3, 2, 5, 3, 0]

在求和之前的中间 out (TrueFalse):

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

现在,如果我们只想要第一组连续的2,让我们屏蔽其他的:

  1. was2 = df.shift(axis=1).eq(2)
  2. (is2 & was3).mask((was2 & ~is2).cummax(axis=1)).sum(axis=1)
  3. # [2, 3, 2, 1, 0, 0]

在求和之前的中间 out(没有字符是 NaN):

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

替代方案

  1. # 这个值是2吗?
  2. is2 = df.eq(2)
  3. # 直接前面的值是2吗?
  4. was2 = is2.shift(axis=1, fill_value=False)
  5. # 前一个非2是3吗?
  6. was3 = df.mask(is2).ffill(axis=1).eq(3)
  7. # 组合条件
  8. out = ((was3 & is2) & ~(was2 & ~is2).cummax(axis=1)).sum(axis=1)
  9. # [2, 3, 2, 1, 0, 0]

在求和之前的中间 out

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
英文:

Another approach using masks.

Starting with the case in which we would count all 2s that follow a 3:

  1. is2 = df.eq(2)
  2. was3 = df.mask(is2).ffill(axis=1).eq(3)
  3. out = (is2 & was3).sum(axis=1)
  4. # [2, 3, 2, 5, 3, 0]

Intermediate out before summing (True is , False is ):

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

Now, if we only want the first stretch of 2s, let's mask the others

  1. was2 = df.shift(axis=1).eq(2)
  2. (is2 & was3).mask((was2 & ~is2).cummax(axis=1)).sum(axis=1)
  3. # [2, 3, 2, 1, 0, 0]

Intermediate out before summing (no character is NaN):

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

alternative

  1. # is the value a 2?
  2. is2 = df.eq(2)
  3. # is the immediate preceding value a 2?
  4. was2 = is2.shift(axis=1, fill_value=False)
  5. # was the previous non-2 a 3?
  6. was3 = df.mask(is2).ffill(axis=1).eq(3)
  7. # combine conditions
  8. out = ((was3 & is2) & ~(was2 & ~is2).cummax(axis=1)).sum(axis=1)
  9. # [2, 3, 2, 1, 0, 0]

Intermediate out before summing:

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

答案2

得分: 1

下面是你提供的代码的翻译部分:

  1. **示例**
  2. `df`
  3. x1 x2 x3 x4 x5 x6 x7
  4. 0 3 3 3.0 3.0 3.0 2.0 2.0
  5. 1 3 3 2.0 2.0 2.0 1.0 2.0
  6. 2 3 2 2.0 1.0 3.0 NaN NaN
  7. 3 3 2 3.0 2.0 2.0 2.0 2.0
  8. 4 3 1 2.0 3.0 2.0 2.0 2.0
  9. 5 3 3 NaN NaN NaN NaN NaN
  10. **步骤1**
  11. 将前3个连续出现的数字3转换为NaN
  12. df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
  13. `df1`
  14. x1 x2 x3 x4 x5 x6 x7
  15. 0 NaN NaN NaN NaN NaN 2.0 2.0
  16. 1 NaN NaN 2.0 2.0 2.0 1.0 2.0
  17. 2 NaN 2.0 2.0 1.0 3.0 NaN NaN
  18. 3 NaN 2.0 3.0 2.0 2.0 2.0 2.0
  19. 4 NaN 1.0 2.0 3.0 2.0 2.0 2.0
  20. 5 NaN NaN NaN NaN NaN NaN NaN
  21. **步骤2**
  22. 找到df1中不是(NaN2)的部分
  23. 将第一次出现之后的所有内容更改为NaN
  24. df2 = df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)]
  25. `df2`
  26. x1 x2 x3 x4 x5 x6 x7
  27. 0 NaN NaN NaN NaN NaN 2.0 2.0
  28. 1 NaN NaN 2.0 2.0 2.0 NaN NaN
  29. 2 NaN 2.0 2.0 NaN NaN NaN NaN
  30. 3 NaN 2.0 NaN NaN NaN NaN NaN
  31. 4 NaN NaN NaN NaN NaN NaN NaN
  32. 5 NaN NaN NaN NaN NaN NaN NaN
  33. **最终**
  34. 按行计算非空值的数量
  35. df2.count(axis=1)
  36. 输出:
  37. 0 2
  38. 1 3
  39. 2 2
  40. 3 1
  41. 4 0
  42. 5 0
  43. **完整代码**
  44. df2用于更好地理解完整代码如下
  45. df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
  46. df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)].count(axis=1)

请注意,我只提供了代码的翻译,没有包含问题的回答。如果需要进一步的解释或其他帮助,请随时提出。

英文:

Example

df

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0 3 3 3.0 3.0 3.0 2.0 2.0
  3. 1 3 3 2.0 2.0 2.0 1.0 2.0
  4. 2 3 2 2.0 1.0 3.0 NaN NaN
  5. 3 3 2 3.0 2.0 2.0 2.0 2.0
  6. 4 3 1 2.0 3.0 2.0 2.0 2.0
  7. 5 3 3 NaN NaN NaN NaN NaN

Step1

convert to NaN all first 3 streak

  1. df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]

df1

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0 NaN NaN NaN NaN NaN 2.0 2.0
  3. 1 NaN NaN 2.0 2.0 2.0 1.0 2.0
  4. 2 NaN 2.0 2.0 1.0 3.0 NaN NaN
  5. 3 NaN 2.0 3.0 2.0 2.0 2.0 2.0
  6. 4 NaN 1.0 2.0 3.0 2.0 2.0 2.0
  7. 5 NaN NaN NaN NaN NaN NaN NaN

Step2

Find the part of df1 that is not (NaN or 2).

Change all occurrences after the first occurrence to NaN

  1. df2 = df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)]

df2

  1. x1 x2 x3 x4 x5 x6 x7
  2. 0 NaN NaN NaN NaN NaN 2.0 2.0
  3. 1 NaN NaN 2.0 2.0 2.0 NaN NaN
  4. 2 NaN 2.0 2.0 NaN NaN NaN NaN
  5. 3 NaN 2.0 NaN NaN NaN NaN NaN
  6. 4 NaN NaN NaN NaN NaN NaN NaN
  7. 5 NaN NaN NaN NaN NaN NaN NaN

Final

count non-null by row

  1. df2.count(axis=1)

output:

  1. 0 2
  2. 1 3
  3. 2 2
  4. 3 1
  5. 4 0
  6. 5 0

Full Code

df2 is made for better understanding, and the full code is as follows

  1. df1 = df[(df.ne(3) & df.shift(1, axis=1).eq(3)).cumsum(axis=1).ge(1)]
  2. df1[(~(df1.eq(2) | df1.isna())).cumsum(axis=1).lt(1)].count(axis=1)

huangapple
  • 本文由 发表于 2023年5月13日 18:47:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76242318.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定