英文:
"Cumcount with Reset" and "Keep Last with Reset" in Python
问题
I have a follow up question related to this prior StackOverflow question.
Suppose I have the following NumPy array:
import numpy as np
v = np.array([
0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
2, 0, 0, 0, 0, 1])
I am trying to obtain a list of all the repeating-element sequences and their starting indices. My hunch is that using Pandas is the most straightforward way to achieve this.
Using the previously referenced StackOverflow answer, I write the following:
import pandas as pd
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby(
(df["digit"] != df["digit"].shift()).cumsum()
)["digit"].cumcount()+1
Producing the result:
digit seq_len
0 0 1
1 0 2
2 0 3
3 1 1
4 3 1
.. ... ...
89 0 1
90 0 2
91 0 3
92 0 4
93 1 1
The last thing I need is to remove the duplicates along the "digit" column in such a way that the last "seq_len" value is kept. Normally, you could use Pandas duplicated
or drop_duplicates
, however these functions don't do any resetting along the column.
What I don't want is:
>>> df.drop_duplicates(subset='digit', keep='last')
digit seq_len
22 3 3
88 2 3
92 0 4
93 1 1
What I do want is something like:
>>> magic_function(df)
digit seq_len
2 0 3
3 1 1
5 3 2
10 1 5
.. ... ...
88 2 3
92 0 4
93 1 1
Of course, if I do "index - seq_len + 1", I can obtain the true starting indices, e.g.,
index digit seq_len
0 0 3
3 1 1
4 3 2
6 1 5
.. ... ...
86 2 3
89 0 4
92 1 1
So anyways, looking for any advice on an efficient magic_function()
to do the above. Appreciate all the help!
英文:
I have a follow up question related to this prior StackOverflow question.
Suppose I have the following NumPy array:
import numpy as np
v = np.array([
0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
2, 0, 0, 0, 0, 1])
I am trying to obtain a list of all the repeating-element sequences and their starting indices. My hunch is that using Pandas is the most straightforward way to achieve this.
Using the previously referenced StackOverflow answer, I write the following:
import pandas as pd
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby(
(df["digit"] != df["digit"].shift()).cumsum()
)["digit"].cumcount()+1
Producing the result:
digit seq_len
0 0 1
1 0 2
2 0 3
3 1 1
4 3 1
.. ... ...
89 0 1
90 0 2
91 0 3
92 0 4
93 1 1
The last thing I need is to remove the duplicates along the "digit" column in such a way that the last "seq_len" value is kept. Normally, you could use Pandas duplicated
or drop_duplicates
, however these functions don't do any resetting along the column.
What I don't want is:
>>> df.drop_duplicates(subset='digit', keep='last')
digit seq_len
22 3 3
88 2 3
92 0 4
93 1 1
What I do want is something like:
>>> magic_function(df)
digit seq_len
2 0 3
3 1 1
5 3 2
10 1 5
.. ... ...
88 2 3
92 0 4
93 1 1
Of course, if I do "index - seq_len + 1", I can obtain the true starting indices, e.g.,
index digit seq_len
0 0 3
3 1 1
4 3 2
6 1 5
.. ... ...
86 2 3
89 0 4
92 1 1
So anyways, looking for any advice on an efficient magic_function()
to do the above. Appreciate all the help!
答案1
得分: 2
这似乎不太美观,但似乎有效:
v = np.array([
0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
2, 0, 0, 0, 0, 1])
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby((df["digit"] != df["digit"].shift()).cumsum())["digit"].cumcount()+1
# 新的部分:
s = df["digit"].diff()[1:] != 0
df.loc[list(s展开收缩.index - 1) + [len(df) - 1]]
英文:
Not pretty, but this seems to work:
v = np.array([
0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
2, 0, 0, 0, 0, 1])
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby((df["digit"] != df["digit"].shift()).cumsum())["digit"].cumcount()+1
# The new stuff:
s = df["digit"].diff()[1:] != 0
df.loc[list(s展开收缩.index - 1) + [len(df) - 1]]
答案2
得分: 2
您可以使用[内置的分组函数][1] `first`/`last`和`size`。对于最后的索引,我使用了一个简单的lambda函数,但我感觉有一种更直接的方式我忘记了。
s = df['digit']
s.groupby(
s.ne(s.shift()).cumsum().rename('group'),
).agg(
last_index=lambda x: x.index[-1],
digit='first',
seq_len='size',
)
```none
last_index digit seq_len
group
1 2 0 3
2 3 1 1
3 5 3 2
... ... ... ...
41 88 2 3
42 92 0 4
43 93 1 1
[43 rows x 3 columns]
<details>
<summary>英文:</summary>
You can use the [builtin groupby functions][1] `first`/`last` and `size`. For the last index, I'm using a simple lambda, but I have a feeling there's an even more straightforward way I'm forgetting.
s = df['digit']
s.groupby(
s.ne(s.shift()).cumsum().rename('group'),
).agg(
last_index=lambda x: x.index[-1],
digit='first',
seq_len='size',
)
```none
last_index digit seq_len
group
1 2 0 3
2 3 1 1
3 5 3 2
... ... ... ...
41 88 2 3
42 92 0 4
43 93 1 1
[43 rows x 3 columns]
答案3
得分: 2
这是代码的一部分,我会为您提供翻译的部分:
Output:
digit seq_len
2 0 3
3 1 1
5 3 2
10 1 5
11 0 1
12 2 1
13 3 1
14 2 1
16 1 2
18 0 2
19 1 1
22 3 3
23 2 1
27 0 4
30 1 3
31 2 1
32 1 1
37 0 5
38 1 1
40 2 2
41 1 1
43 0 2
46 1 3
49 0 3
...
85 1 2
88 2 3
92 0 4
93 1 1
希望这对您有所帮助。如果您需要进一步的翻译或有其他问题,请随时告诉我。
英文:
This should work as well:
(df.assign(seq_len = df.groupby(df['digit'].diff().ne(0).cumsum())['digit'].transform('size'))
.loc[df['digit'].iloc[::-1].diff().ne(0)])
or
m = df['digit'].diff().ne(0).cumsum()
df.assign(seq_len = df.groupby(m)['digit'].cumcount()+1).groupby(m).tail(1)
or
m = df['digit'].diff().ne(0).cumsum()
df.assign(seq_len = df.groupby(m)['digit'].cumcount()+1).loc[lambda x: x.groupby(m)['seq_len'].idxmax()]
or
(df.groupby(
[0,df[0].diff().ne(0).cumsum()],sort=False).size()
.droplevel(1)
.rename_axis('digit')
.reset_index(name = 'seq_len'))
Output:
digit seq_len
2 0 3
3 1 1
5 3 2
10 1 5
11 0 1
12 2 1
13 3 1
14 2 1
16 1 2
18 0 2
19 1 1
22 3 3
23 2 1
27 0 4
30 1 3
31 2 1
32 1 1
37 0 5
38 1 1
40 2 2
41 1 1
43 0 2
46 1 3
49 0 3
...
85 1 2
88 2 3
92 0 4
93 1 1
答案4
得分: 1
你可以使用 GroupBy.apply
与 布尔索引:
m = df["digit"].ne(df["digit"].shift())
grp = df.groupby(m.cumsum(), group_keys=False)
df["seq_len"] = grp["digit"].cumcount().add(1)
out = grp.apply(lambda g: g.loc[~g["digit"].duplicated(keep="last")])
输出:
print(out)
digit seq_len
2 0 3
3 1 1
5 3 2
10 1 5
11 0 1
.. ... ...
83 0 7
85 1 2
88 2 3
92 0 4
93 1 1
[43 rows x 2 columns]
英文:
You can use GroupBy.apply
with boolean indexing :
m = df["digit"].ne(df["digit"].shift())
grp = df.groupby(m.cumsum(), group_keys=False)
df["seq_len"] = grp["digit"].cumcount().add(1)
out = grp.apply(lambda g: g.loc[~g["digit"].duplicated(keep="last")])
Output :
print(out)
digit seq_len
2 0 3
3 1 1
5 3 2
10 1 5
11 0 1
.. ... ...
83 0 7
85 1 2
88 2 3
92 0 4
93 1 1
[43 rows x 2 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论