2023年5月18日 06:38:58go评论64阅读模式

英文:

"Cumcount with Reset" and "Keep Last with Reset" in Python

问题

I have a follow up question related to this prior StackOverflow question.

Suppose I have the following NumPy array:

import numpy as np
v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

I am trying to obtain a list of all the repeating-element sequences and their starting indices. My hunch is that using Pandas is the most straightforward way to achieve this.

Using the previously referenced StackOverflow answer, I write the following:

import pandas as pd
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby(
    (df["digit"] != df["digit"].shift()).cumsum()
    )["digit"].cumcount()+1

Producing the result:

    digit  seq_len
0       0        1
1       0        2
2       0        3
3       1        1
4       3        1
..    ...      ...
89      0        1
90      0        2
91      0        3
92      0        4
93      1        1

The last thing I need is to remove the duplicates along the "digit" column in such a way that the last "seq_len" value is kept. Normally, you could use Pandas duplicated or drop_duplicates, however these functions don't do any resetting along the column.

What I don't want is:

>>> df.drop_duplicates(subset='digit', keep='last')
    digit  seq_len
22      3        3
88      2        3
92      0        4
93      1        1

What I do want is something like:

>>> magic_function(df)
    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
..    ...      ...
88      2        3
92      0        4
93      1        1

Of course, if I do "index - seq_len + 1", I can obtain the true starting indices, e.g.,

index    digit  seq_len
0            0        3
3            1        1
4            3        2
6            1        5
..         ...      ...
86           2        3
89           0        4
92           1        1

So anyways, looking for any advice on an efficient magic_function() to do the above. Appreciate all the help!

英文:

I have a follow up question related to this prior StackOverflow question.

Suppose I have the following NumPy array:

import numpy as np
v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

I am trying to obtain a list of all the repeating-element sequences and their starting indices. My hunch is that using Pandas is the most straightforward way to achieve this.

Using the previously referenced StackOverflow answer, I write the following:

import pandas as pd
df = pd.DataFrame(v, columns=[&#39;digit&#39;])
df[&quot;seq_len&quot;] = df.groupby(
    (df[&quot;digit&quot;] != df[&quot;digit&quot;].shift()).cumsum()
    )[&quot;digit&quot;].cumcount()+1

Producing the result:

    digit  seq_len
0       0        1
1       0        2
2       0        3
3       1        1
4       3        1
..    ...      ...
89      0        1
90      0        2
91      0        3
92      0        4
93      1        1

What I don't want is:

&gt;&gt;&gt; df.drop_duplicates(subset=&#39;digit&#39;, keep=&#39;last&#39;)
    digit  seq_len
22      3        3
88      2        3
92      0        4
93      1        1

What I do want is something like:

&gt;&gt;&gt; magic_function(df)
    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
..    ...      ...
88      2        3
92      0        4
93      1        1

Of course, if I do "index - seq_len + 1", I can obtain the true starting indices, e.g.,

index    digit  seq_len
0            0        3
3            1        1
4            3        2
6            1        5
..         ...      ...
86           2        3
89           0        4
92           1        1

So anyways, looking for any advice on an efficient magic_function() to do the above. Appreciate all the help!

答案1

得分: 2

这似乎不太美观，但似乎有效：

v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby((df["digit"] != df["digit"].shift()).cumsum())["digit"].cumcount()+1

# 新的部分：
s = df["digit"].diff()[1:] != 0
df.loc[list(s展开收缩
.index - 1) + [len(df) - 1]]

英文:

Not pretty, but this seems to work:


v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

df = pd.DataFrame(v, columns=[&#39;digit&#39;])
df[&quot;seq_len&quot;] = df.groupby((df[&quot;digit&quot;] != df[&quot;digit&quot;].shift()).cumsum())[&quot;digit&quot;].cumcount()+1

# The new stuff:
s = df[&quot;digit&quot;].diff()[1:] != 0
df.loc[list(s展开收缩
.index - 1) + [len(df) - 1]]

答案2

得分: 2

您可以使用[内置的分组函数][1] `first`/`last`和`size`。对于最后的索引，我使用了一个简单的lambda函数，但我感觉有一种更直接的方式我忘记了。

s = df['digit']
s.groupby(
s.ne(s.shift()).cumsum().rename('group'),
).agg(
last_index=lambda x: x.index[-1],
digit='first',
seq_len='size',
)


```none
       last_index  digit  seq_len
group                            
1               2      0        3
2               3      1        1
3               5      3        2
...           ...    ...      ...
41             88      2        3
42             92      0        4
43             93      1        1

[43 rows x 3 columns]


<details>
<summary>英文:</summary>

You can use the [builtin groupby functions][1] `first`/`last` and `size`. For the last index, I&#39;m using a simple lambda, but I have a feeling there&#39;s an even more straightforward way I&#39;m forgetting.

s = df['digit']
s.groupby(
s.ne(s.shift()).cumsum().rename('group'),
).agg(
last_index=lambda x: x.index[-1],
digit='first',
seq_len='size',
)


```none
       last_index  digit  seq_len
group                            
1               2      0        3
2               3      1        1
3               5      3        2
...           ...    ...      ...
41             88      2        3
42             92      0        4
43             93      1        1

[43 rows x 3 columns]

答案3

得分: 2

这是代码的一部分，我会为您提供翻译的部分：

Output:

    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
11      0        1
12      2        1
13      3        1
14      2        1
16      1        2
18      0        2
19      1        1
22      3        3
23      2        1
27      0        4
30      1        3
31      2        1
32      1        1
37      0        5
38      1        1
40      2        2
41      1        1
43      0        2
46      1        3
49      0        3
...
85      1        2
88      2        3
92      0        4
93      1        1

希望这对您有所帮助。如果您需要进一步的翻译或有其他问题，请随时告诉我。

英文:

This should work as well:

(df.assign(seq_len = df.groupby(df[&#39;digit&#39;].diff().ne(0).cumsum())[&#39;digit&#39;].transform(&#39;size&#39;))
.loc[df[&#39;digit&#39;].iloc[::-1].diff().ne(0)])

m = df[&#39;digit&#39;].diff().ne(0).cumsum()
df.assign(seq_len = df.groupby(m)[&#39;digit&#39;].cumcount()+1).groupby(m).tail(1)

m = df[&#39;digit&#39;].diff().ne(0).cumsum()
df.assign(seq_len = df.groupby(m)[&#39;digit&#39;].cumcount()+1).loc[lambda x: x.groupby(m)[&#39;seq_len&#39;].idxmax()]

(df.groupby(
    [0,df[0].diff().ne(0).cumsum()],sort=False).size()
    .droplevel(1)
    .rename_axis(&#39;digit&#39;)
    .reset_index(name = &#39;seq_len&#39;))

Output:

    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
11      0        1
12      2        1
13      3        1
14      2        1
16      1        2
18      0        2
19      1        1
22      3        3
23      2        1
27      0        4
30      1        3
31      2        1
32      1        1
37      0        5
38      1        1
40      2        2
41      1        1
43      0        2
46      1        3
49      0        3
...
85      1        2
88      2        3
92      0        4
93      1        1

答案4

得分: 1

你可以使用 GroupBy.apply 与 布尔索引：

m = df["digit"].ne(df["digit"].shift())
grp = df.groupby(m.cumsum(), group_keys=False)

df["seq_len"] = grp["digit"].cumcount().add(1)

out = grp.apply(lambda g: g.loc[~g["digit"].duplicated(keep="last")])

输出：

print(out)

    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
11      0        1
..    ...      ...
83      0        7
85      1        2
88      2        3
92      0        4
93      1        1

[43 rows x 2 columns]

英文:

You can use GroupBy.apply with boolean indexing :

m = df[&quot;digit&quot;].ne(df[&quot;digit&quot;].shift())
grp = df.groupby(m.cumsum(), group_keys=False)

df[&quot;seq_len&quot;] = grp[&quot;digit&quot;].cumcount().add(1)

out = grp.apply(lambda g: g.loc[~g[&quot;digit&quot;].duplicated(keep=&quot;last&quot;)])

Output :

print(out)

    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
11      0        1
..    ...      ...
83      0        7
85      1        2
88      2        3
92      0        4
93      1        1

[43 rows x 2 columns]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

"Cumcount with Reset" and "Keep Last with Reset" in Python

问题

答案1

答案2

答案3

答案4

在pandas数据帧中如何基于其他行创建新列？

Pandas – 创建新列，其值取自同一数据框中的其他行

Pandas中.iloc API的索引

检查连续日期之间满足相同条件的 N 个列，并返回每个组的列数和ID。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论