"Cumcount with Reset" and "Keep Last with Reset" in Python

huangapple go评论64阅读模式
英文:

"Cumcount with Reset" and "Keep Last with Reset" in Python

问题

I have a follow up question related to this prior StackOverflow question.

Suppose I have the following NumPy array:

import numpy as np
v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

I am trying to obtain a list of all the repeating-element sequences and their starting indices. My hunch is that using Pandas is the most straightforward way to achieve this.

Using the previously referenced StackOverflow answer, I write the following:

import pandas as pd
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby(
    (df["digit"] != df["digit"].shift()).cumsum()
    )["digit"].cumcount()+1

Producing the result:

    digit  seq_len
0       0        1
1       0        2
2       0        3
3       1        1
4       3        1
..    ...      ...
89      0        1
90      0        2
91      0        3
92      0        4
93      1        1

The last thing I need is to remove the duplicates along the "digit" column in such a way that the last "seq_len" value is kept. Normally, you could use Pandas duplicated or drop_duplicates, however these functions don't do any resetting along the column.

What I don't want is:

>>> df.drop_duplicates(subset='digit', keep='last')
    digit  seq_len
22      3        3
88      2        3
92      0        4
93      1        1

What I do want is something like:

>>> magic_function(df)
    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
..    ...      ...
88      2        3
92      0        4
93      1        1

Of course, if I do "index - seq_len + 1", I can obtain the true starting indices, e.g.,

index    digit  seq_len
0            0        3
3            1        1
4            3        2
6            1        5
..         ...      ...
86           2        3
89           0        4
92           1        1

So anyways, looking for any advice on an efficient magic_function() to do the above. Appreciate all the help!

英文:

I have a follow up question related to this prior StackOverflow question.

Suppose I have the following NumPy array:

import numpy as np
v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

I am trying to obtain a list of all the repeating-element sequences and their starting indices. My hunch is that using Pandas is the most straightforward way to achieve this.

Using the previously referenced StackOverflow answer, I write the following:

import pandas as pd
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby(
    (df["digit"] != df["digit"].shift()).cumsum()
    )["digit"].cumcount()+1

Producing the result:

    digit  seq_len
0       0        1
1       0        2
2       0        3
3       1        1
4       3        1
..    ...      ...
89      0        1
90      0        2
91      0        3
92      0        4
93      1        1

The last thing I need is to remove the duplicates along the "digit" column in such a way that the last "seq_len" value is kept. Normally, you could use Pandas duplicated or drop_duplicates, however these functions don't do any resetting along the column.

What I don't want is:

>>> df.drop_duplicates(subset='digit', keep='last')
    digit  seq_len
22      3        3
88      2        3
92      0        4
93      1        1

What I do want is something like:

>>> magic_function(df)
    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
..    ...      ...
88      2        3
92      0        4
93      1        1

Of course, if I do "index - seq_len + 1", I can obtain the true starting indices, e.g.,

index    digit  seq_len
0            0        3
3            1        1
4            3        2
6            1        5
..         ...      ...
86           2        3
89           0        4
92           1        1

So anyways, looking for any advice on an efficient magic_function() to do the above. Appreciate all the help!

答案1

得分: 2

这似乎不太美观,但似乎有效:

v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby((df["digit"] != df["digit"].shift()).cumsum())["digit"].cumcount()+1

# 新的部分:
s = df["digit"].diff()[1:] != 0
df.loc[list(s
展开收缩
.index - 1) + [len(df) - 1]]
英文:

Not pretty, but this seems to work:


v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby((df["digit"] != df["digit"].shift()).cumsum())["digit"].cumcount()+1

# The new stuff:
s = df["digit"].diff()[1:] != 0
df.loc[list(s
展开收缩
.index - 1) + [len(df) - 1]]

答案2

得分: 2

您可以使用[内置的分组函数][1] `first`/`last`和`size`。对于最后的索引,我使用了一个简单的lambda函数,但我感觉有一种更直接的方式我忘记了。

s = df['digit']
s.groupby(
s.ne(s.shift()).cumsum().rename('group'),
).agg(
last_index=lambda x: x.index[-1],
digit='first',
seq_len='size',
)


```none
       last_index  digit  seq_len
group                            
1               2      0        3
2               3      1        1
3               5      3        2
...           ...    ...      ...
41             88      2        3
42             92      0        4
43             93      1        1

[43 rows x 3 columns]

<details>
<summary>英文:</summary>

You can use the [builtin groupby functions][1] `first`/`last` and `size`. For the last index, I&#39;m using a simple lambda, but I have a feeling there&#39;s an even more straightforward way I&#39;m forgetting.

s = df['digit']
s.groupby(
s.ne(s.shift()).cumsum().rename('group'),
).agg(
last_index=lambda x: x.index[-1],
digit='first',
seq_len='size',
)


```none
       last_index  digit  seq_len
group                            
1               2      0        3
2               3      1        1
3               5      3        2
...           ...    ...      ...
41             88      2        3
42             92      0        4
43             93      1        1

[43 rows x 3 columns]

答案3

得分: 2

这是代码的一部分,我会为您提供翻译的部分:

Output:

    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
11      0        1
12      2        1
13      3        1
14      2        1
16      1        2
18      0        2
19      1        1
22      3        3
23      2        1
27      0        4
30      1        3
31      2        1
32      1        1
37      0        5
38      1        1
40      2        2
41      1        1
43      0        2
46      1        3
49      0        3
...
85      1        2
88      2        3
92      0        4
93      1        1

希望这对您有所帮助。如果您需要进一步的翻译或有其他问题,请随时告诉我。

英文:

This should work as well:

(df.assign(seq_len = df.groupby(df[&#39;digit&#39;].diff().ne(0).cumsum())[&#39;digit&#39;].transform(&#39;size&#39;))
.loc[df[&#39;digit&#39;].iloc[::-1].diff().ne(0)])

or

m = df[&#39;digit&#39;].diff().ne(0).cumsum()
df.assign(seq_len = df.groupby(m)[&#39;digit&#39;].cumcount()+1).groupby(m).tail(1)

or

m = df[&#39;digit&#39;].diff().ne(0).cumsum()
df.assign(seq_len = df.groupby(m)[&#39;digit&#39;].cumcount()+1).loc[lambda x: x.groupby(m)[&#39;seq_len&#39;].idxmax()]

or

(df.groupby(
    [0,df[0].diff().ne(0).cumsum()],sort=False).size()
    .droplevel(1)
    .rename_axis(&#39;digit&#39;)
    .reset_index(name = &#39;seq_len&#39;))

Output:

    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
11      0        1
12      2        1
13      3        1
14      2        1
16      1        2
18      0        2
19      1        1
22      3        3
23      2        1
27      0        4
30      1        3
31      2        1
32      1        1
37      0        5
38      1        1
40      2        2
41      1        1
43      0        2
46      1        3
49      0        3
...
85      1        2
88      2        3
92      0        4
93      1        1

答案4

得分: 1

你可以使用 GroupBy.apply布尔索引

m = df["digit"].ne(df["digit"].shift())
grp = df.groupby(m.cumsum(), group_keys=False)

df["seq_len"] = grp["digit"].cumcount().add(1)

out = grp.apply(lambda g: g.loc[~g["digit"].duplicated(keep="last")])

输出:

print(out)

    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
11      0        1
..    ...      ...
83      0        7
85      1        2
88      2        3
92      0        4
93      1        1

[43 rows x 2 columns]
英文:

You can use GroupBy.apply with boolean indexing :

m = df[&quot;digit&quot;].ne(df[&quot;digit&quot;].shift())
grp = df.groupby(m.cumsum(), group_keys=False)

df[&quot;seq_len&quot;] = grp[&quot;digit&quot;].cumcount().add(1)

out = grp.apply(lambda g: g.loc[~g[&quot;digit&quot;].duplicated(keep=&quot;last&quot;)])

Output :

print(out)

    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
11      0        1
..    ...      ...
83      0        7
85      1        2
88      2        3
92      0        4
93      1        1

[43 rows x 2 columns]

huangapple
  • 本文由 发表于 2023年5月18日 06:38:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276625.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定