在Pandas数据框中,在特定列之后给定大小的第一个零序列的长度。

huangapple go评论143阅读模式
英文:

Length of first sequence of zeros of given size after certain column in pandas dataframe

问题

假设我有一个类似这样的数据框:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  day_90
-------------------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156   47
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231   35
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78   16
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5   3

其中最后一列 [day_90] 包含了每行累积了总计的90%的值,根据哪一列([0] - [90])是每行将在90天内达到的156事件的90%的值。为了澄清,以第一行为例:在第47列中,ID A 在90天内达到的156事件中累积了90%的值。

我需要的是:对于每一行,计算第一个连续的0的序列的长度,该序列大于7(或任何预先定义的任意数字)。所以,例如:对于第一行,我想知道在第47列后的第一个零序列有多长,但只有在序列中的零超过7个时才计数。如果有6个零然后是非零值,那么我不想计数。

最后,我想将这个结果存储在 [day_90] 后的一个新列中。所以如果ID A 在第47列后有一个长度为10的零序列,我想添加一个名为 [0_sequence] 的新列,其中包含该ID的值为10。

我真的不知道从哪里开始。任何帮助都会感激不尽 =)

英文:

Suppose I have a dataframe like this:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  day_90
-------------------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156   47
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231   35
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78   16
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5   3

where the last column [day_90] contains the value of which column ([0] - [90]) accumulates 90% of the [total] for each row. To clarify, take the first row as an example: in the 47th column, the ID A hits a total of 90% of 156 events that he will achieve in 90 days.

What I need is: for each row, count the length of the first sequence of 0s that is bigger than 7 (or any arbitrary number predefined). So, for example: for the first row, I want to know how long is the first sequence of zeros after column 47, but only if the sequence exceeds 7 zeros in a row. If there are 6 zeros and then a non-zero, then I don't want to count it.

Finally, I want to store this result in a new column after [day_90]. So if ID A has a sequence of 10 zeros right after column 47, I want to add a new column [0_sequence] that holds the value of 10 for that ID.

I really have no idea where to start. Any help is appreciated =)

答案1

得分: 5

你的问题基本上是岛和间隙问题的一个变种:非零元素创建一个新的“岛屿”,而0扩展当前的“岛屿”。你想要找到第一个至少具有特定大小的岛屿。在回答你的问题之前,让我通过一个精简版的问题来解释一下。

假设你有一个Series(系列):

a = pd.Series([0,0,0,13,0,0,4,12,0,0])
0     0
1     0
2     0
3    13
4     0
5     0
6     4
7    12
8     0
9     0

你想要找到至少有3个元素的第一个0序列的长度。首先,让我们将它们分为“岛屿”:

# 每当数字非零时,都会创建一个新的“岛屿”
b = (a != 0).cumsum()
0    0  <-- 岛屿 0
1    0
2    0
3    1  <-- 岛屿 1
4    1
5    1
6    2  <-- 岛屿 2
7    3  <-- 岛屿 3
8    3
9    3

对于每个岛屿,我们只关心等于0的元素:

c = b[a == 0]
0    0
1    0
2    0
4    1
5    1
8    3
9    3

现在让我们确定每个岛屿的大小:

d = c.groupby(c).count()
0    3  <-- 岛屿 0 的大小为 3
1    2  <-- 岛屿 1 的大小为 2
3    2  <-- 岛屿 3 的大小为 2
dtype: int64

并过滤大小大于等于3的岛屿:

e = d[d >= 3]
0    3

答案是e的第一个元素(岛屿 0,大小为 3),如果e不为空。否则,没有符合我们条件的岛屿。


第一次尝试

将其应用于你的问题:

def count_sequence_length(row, n):
    """返回长度至少为n的第一个0序列在`day_90`列之后的长度"""
    if row['day_90'] + n > 90:
        return 0
    
    # `day_90`列之后的列
    idx = np.arange(row['day_90']+1, 91)

    a = row[idx]
    b = (a != 0).cumsum()
    c = b[a == 0]
    d = c.groupby(c).count()
    e = d[d >= n]
    
    return 0 if len(e) == 0 else e.iloc[0]

df['0_sequence'] = df.apply(count_sequence_length, n=7, axis=1)

第二次尝试

上面的版本很好,但因为它计算了所有岛屿的大小,所以速度较慢。由于你只关心满足条件的第一个岛屿的大小,一个简单的for循环会更快:

def count_sequence_length_2(row, n):
    if row['day_90'] + n > 90:
        return 0
    
    size = 0
    for i in range(row['day_90']+1, 91):
        if row[i] == 0:
            # 增加当前岛屿的大小
            size += 1
        elif size >= n:
            # 找到我们想要的岛屿,不再搜索
            break
        else:
            # 创建一个新的岛屿
            size = 0
    return size if size >= n else 0

df['0_sequence'] = df.apply(count_sequence_length_2, n=7, axis=1)

这在我的基准测试中可以实现10到20倍的速度提升
英文:

Your problem is basically a variant of the island-and-gap problem: a non-zero creates a new "island" while a 0 extend the current island. And you want to find the first island that is of a certain size. Before I answer your question, let me walk you through a minified version of the problem.

Let's say you have a Series:

&gt;&gt;&gt; a = pd.Series([0,0,0,13,0,0,4,12,0,0])
0     0
1     0
2     0
3    13
4     0
5     0
6     4
7    12
8     0
9     0

And you want to find the length of the first sequence of 0s that is at least 3-element in length. Let's first assign them into "islands":

# Every time the number is non-zero, a new &quot;island&quot; is created
&gt;&gt;&gt; b = (a != 0).cumsum()
0    0  &lt;-- island 0
1    0
2    0
3    1  &lt;-- island 1
4    1
5    1
6    2  &lt;-- island 2
7    3  &lt;-- island 3
8    3
9    3

For each island, we are only interested in elements that are equal to 0:

&gt;&gt;&gt; c = b[a == 0]
0    0
1    0
2    0
4    1
5    1
8    3
9    3

Now let's determine the size of each island:

&gt;&gt;&gt; d = c.groupby(c).count()
0    3  &lt;-- island 0 is of size 3
1    2  &lt;-- island 1 is of size 2
3    2  &lt;-- island 3 is of size 2
dtype: int64

And filter for islands whose size >= 3:

&gt;&gt;&gt; e = d[d &gt;= 3]
0    3

The answer is the first element of e (island 0, size 3) if e is not empty. Otherwise, there's no island meeting our criteria.


First Attempt

And applying it to your problem:

def count_sequence_length(row, n):
    &quot;&quot;&quot;Return of the length of the first sequence of 0
    after the column in `day_90` whose length is &gt;= n
    &quot;&quot;&quot;
    if row[&#39;day_90&#39;] + n &gt; 90:
        return 0
    
    # The columns after `day_90`
    idx = np.arange(row[&#39;day_90&#39;]+1, 91)

    a = row[idx]
    b = (a != 0).cumsum()
    c = b[a == 0]
    d = c.groupby(c).count()
    e = d[d &gt;= n]
    
    return 0 if len(e) == 0 else e.iloc[0]

df[&#39;0_sequence&#39;] = df.apply(count_sequence_length, n=7, axis=1)

Second Attempt

The above version is nice, but slow because it calculates the size of all islands. Since you only care about the size of first the island meeting the criteria, a simple for loop works much faster:

def count_sequence_length_2(row, n):
    if row[&#39;day_90&#39;] + n &gt; 90:
        return 0
    
    size = 0
    for i in range(row[&#39;day_90&#39;]+1, 91):
        if row[i] == 0:
            # increase the size of the current island
            size += 1
        elif size &gt;= n:
            # found the island we want. Search no more
            break
        else:
            # create a new island
            size = 0
    return size if size &gt;= n else 0

df[&#39;0_sequence&#39;] = df.apply(count_sequence_length_2, n=7, axis=1)

This achieves a speed up between 10 - 20x on when I benchmark it.

答案2

得分: 1

这是我的解决方案,请查看代码中的注释:

import numpy as np, pandas as pd
import io

# 测试数据:
text="""  ID  0   1   2   3  4  5   6   7  8  day_90
        0  A  2  21   0  18  3  0   0   0  2       4
        1  B  0  20  12   2  0  8  14  23  0       5
        2  C  0  38  19   3  1  3   3   7  1       1
        3  D  3   0   0   1  0  0   0   0  0       0"""
df= pd.read_csv( io.StringIO(text),sep=r"\s+",engine="python")
#------------------------

# 将一些列名转换为整数:
cols= list(range(9))
df.columns= ["ID"]+ cols +["day_90"]

#----------
istart,istop= df.columns.get_loc(0), df.columns.get_loc(8)+1
# 第一个零序列所需的长度:
lseq= 2

# 聚合函数:这是主要的计算,'r' 是 'df' 的一行:
def zz(r):

     s= r.iloc[r.day_90+istart:istop] # 获取从'day_90'开始的日列
     #--- 操作 's' 以便使用 'groupby' 获取不同的序列:
     crit=s.eq(0)
     s= pd.Series(np.where(crit, np.nan, np.arange(len(s))),index=s.index)
     if np.isnan(s.iloc[0]):
       s.iloc[0]= 1
     s= s.ffill()
     s[~crit]= np.nan
     #---
     # 获取序列及其大小:
     ssiz= s.groupby(s).size()
     return ssiz.iloc[0] if len(ssiz) and ssiz.iloc[0]>lseq else np.nan
#---

df["zseq"]= df.agg(zz,axis=1)

ID  0   1   2   3  4  5   6   7  8  day_90  zseq
0  A  2  21   0  18  3  0   0   0  2       4   3.0
1  B  0  20  12   2  0  8  14  23  0       5   NaN
2  C  0  38  19   3  1  3   3   7  1       1   NaN
3  D  3   0   0   1  0  0   0   0  0       0   NaN

希望这能帮助您理解代码。

英文:

Here is my solution, see the comments in the code:

import numpy as np, pandas as pd
import io
# Test data:
text=&quot;&quot;&quot;  ID  0   1   2   3  4  5   6   7  8  day_90
0  A  2  21   0  18  3  0   0   0  2       4
1  B  0  20  12   2  0  8  14  23  0       5
2  C  0  38  19   3  1  3   3   7  1       1
3  D  3   0   0   1  0  0   0   0  0       0&quot;&quot;&quot;
df= pd.read_csv( io.StringIO(text),sep=r&quot;\s+&quot;,engine=&quot;python&quot;)
#------------------------
# Convert some column names into integer:
cols= list(range(9))
df.columns= [&quot;ID&quot;]+ cols +[&quot;day_90&quot;]
#----------
istart,istop= df.columns.get_loc(0), df.columns.get_loc(8)+1
# The required length of the 1st zero sequence:
lseq= 2
# The function for aggregating: this is the main calculation, &#39;r&#39; is a row of &#39;df&#39;:
def zz(r):
s= r.iloc[r.day_90+istart:istop] # get the day columns starting with as fixed in &#39;day_90&#39;
#--- Manipulate &#39;s&#39; to make possible using &#39;groupby&#39; for getting different sequences:
crit=s.eq(0)
s= pd.Series(np.where(crit, np.nan, np.arange(len(s))),index=s.index)
if np.isnan(s.iloc[0]):
s.iloc[0]= 1
s= s.ffill()
s[~crit]= np.nan
#---
# get the sequences and their sizes:
ssiz= s.groupby(s).size()
return ssiz.iloc[0] if len(ssiz) and ssiz.iloc[0]&gt;lseq else np.nan
#---
df[&quot;zseq&quot;]= df.agg(zz,axis=1)
ID  0   1   2   3  4  5   6   7  8  day_90  zseq
0  A  2  21   0  18  3  0   0   0  2       4   3.0
1  B  0  20  12   2  0  8  14  23  0       5   NaN
2  C  0  38  19   3  1  3   3   7  1       1   NaN
3  D  3   0   0   1  0  0   0   0  0       0   NaN

huangapple
  • 本文由 发表于 2020年1月3日 23:39:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/59581340.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定