2023年4月4日 15:13:08go评论71阅读模式

英文:

Duplicate last row of a group/id in a dataframe, based on certain condition

问题

以下是要翻译的代码部分：

#Load the required libraries
import pandas as pd

#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1,
               2, 2,
               3, 3, 3, 3, 3, 3,
               4, 
               5, 5, 5, 5, 5,5, 5, 5,5],
        'cycle': [1,2, 3, 4, 5,6,7,8,9,10,11,
                  1,2, 
                  1,2, 3, 4, 5,6,
                  1,
                  1,2, 3, 4, 5,6,7,8,9,],
        'Salary': [7, 7, 7,8,9,10,11,12,13,14,15,
                   4, 5,
                   8,9,10,11,12,13,
                   8,
                   7, 7,9,10,11,12,13,14,15,],
        'Children': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No',
                     'Yes', 'No', 
                     'No','Yes', 'Yes', 'No','No', 'Yes',
                     'Yes',
                      'No',  'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No',],
        'Days': [123, 128, 66, 66, 120, 141, 52,96, 120, 141, 52,
                 96, 120,
                 15,123, 128, 66, 120, 141,
                 141,
                 123, 128, 66, 123, 128, 66, 120, 141, 52,],
        }

#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)

请注意，此代码是用HTML编码嵌套在字符串中的，我已经将其解码为Python代码。

英文:

I have the below dataframe:

#Load the required libraries
import pandas as pd
#Create dataset
data = {&#39;id&#39;: [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1,
2, 2,
3, 3, 3, 3, 3, 3,
4, 
5, 5, 5, 5, 5,5, 5, 5,5],
&#39;cycle&#39;: [1,2, 3, 4, 5,6,7,8,9,10,11,
1,2, 
1,2, 3, 4, 5,6,
1,
1,2, 3, 4, 5,6,7,8,9,],
&#39;Salary&#39;: [7, 7, 7,8,9,10,11,12,13,14,15,
4, 5,
8,9,10,11,12,13,
8,
7, 7,9,10,11,12,13,14,15,],
&#39;Children&#39;: [&#39;No&#39;, &#39;Yes&#39;, &#39;Yes&#39;, &#39;Yes&#39;, &#39;Yes&#39;, &#39;No&#39;,&#39;No&#39;, &#39;Yes&#39;, &#39;Yes&#39;, &#39;Yes&#39;, &#39;No&#39;,
&#39;Yes&#39;, &#39;No&#39;, 
&#39;No&#39;,&#39;Yes&#39;, &#39;Yes&#39;, &#39;No&#39;,&#39;No&#39;, &#39;Yes&#39;,
&#39;Yes&#39;,
&#39;No&#39;,  &#39;Yes&#39;, &#39;No&#39;, &#39;No&#39;, &#39;Yes&#39;, &#39;Yes&#39;, &#39;Yes&#39;, &#39;Yes&#39;, &#39;No&#39;,],
&#39;Days&#39;: [123, 128, 66, 66, 120, 141, 52,96, 120, 141, 52,
96, 120,
15,123, 128, 66, 120, 141,
141,
123, 128, 66, 123, 128, 66, 120, 141, 52,],
}
#Convert to dataframe
df = pd.DataFrame(data)
print(&quot;df = \n&quot;, df)

The above dataframe looks as such:

Here, every id has different cycles as per the 'cycle' column. For example,

id-1 has maximum 11 cycles.

id-2 has maximum 2 cycle.

id-3 has maximum 6 cycles.

id-4 has maximum 1 cycle.

id-5 has maximum 9 cycles.

I have a certain threshold limit on 'cycles'. Let's say cycle_threshold = 3

If maximum cycles for and id <= cycle_threshold, then duplicate the last row of that group/id till the cycle reaches 4.

For example,

For id-2, since maximum cycles are 2 (which is less than 3), repeat the last row of id-2, till the cycle becomes 4.

For id-4, since maximum cycles are 1 (which is less than 3), repeat the last row of id-4, till the cycle becomes 4.

The rest of the group's/id's remain as it is.

The result looks as such:

Can somebody please let me know how do I achieve this task in Python?

答案1

得分: 0

你可以通过比较每个分组中的cycle的最大值来确定id，然后对于筛选后的分组，使用MultiIndex.from_product创建MultiIndex，并使用DataFrame.reindex和参数method='ffill'重复最后一个值：

cycle_threshold = 3
M = 4

# 获取每个id的最大cycle
s = df.groupby('id')['cycle'].max()

# 获取小于或等于阈值的id
ids = s.index展开收缩

# 检查id是否匹配id列表
mask = df['id'].isin(ids)

# 为筛选后的分组创建MultiIndex，其中cycle范围是1到M + 1
df1 = df[mask].set_index(['id', 'cycle'])
mux = pd.MultiIndex.from_product([ids, range(1, M + 1)], names=['id', 'cycle'])

# 按id和cycle排序，追加新的分组
df2 = (pd.concat([df[~mask],
                 df1.reindex(mux, method='ffill').reset_index()])
          .sort_values(['id', 'cycle'], ignore_index=True)
)

另一个想法是创建最终的MultiIndex并重新索引所有行：

cycle_threshold = 3
M = 4

s = df.groupby('id')['cycle'].max()
ids = s.index展开收缩

mask = df['id'].isin(ids)

df2 = df.set_index(['id', 'cycle'])
mux = (pd.MultiIndex.from_product([ids, range(1, 5)], 
                                 names=['id', 'cycle']).append(df2[~mask.to_numpy()].index)
        .sort_values())
df2 = df2.reindex(mux, method='ffill').reset_index()

这些代码将生成一个名为df2的DataFrame，其中包含重新排列和填充的数据。

英文:

You can idetify id by compare maximal values of cycle per groups and for filtered groups create MultiIndex.from_product and reepat last value by DataFrame.reindex with parameter method='ffill':

cycle_threshold = 3
M = 4
#get maximal cycle per id
s = df.groupby(&#39;id&#39;)[&#39;cycle&#39;].max()
#get id less or equal threshold
ids = s.index展开收缩
#test groups by id match list ids
mask = df[&#39;id&#39;].isin(ids)
#create MultiIndex with range(1,M + 1) for filtered groups
df1 = df[mask].set_index([&#39;id&#39;,&#39;cycle&#39;])
mux = pd.MultiIndex.from_product([ids, range(1, M + 1)], names=[&#39;id&#39;,&#39;cycle&#39;])
#append new groups with sorting by id and cycle
df2 = (pd.concat([df[~mask],
df1.reindex(mux, method=&#39;ffill&#39;).reset_index()])
.sort_values([&#39;id&#39;,&#39;cycle&#39;], ignore_index=True))

Another idea is create final MultiIndex and reindex all rows:

cycle_threshold = 3
M = 4
s = df.groupby(&#39;id&#39;)[&#39;cycle&#39;].max()
ids = s.index展开收缩
mask = df[&#39;id&#39;].isin(ids)
df2 = df.set_index([&#39;id&#39;,&#39;cycle&#39;])
mux = (pd.MultiIndex.from_product([ids, range(1, 5)], 
names=[&#39;id&#39;,&#39;cycle&#39;]).append(df2[~mask.to_numpy()].index)
.sort_values())
df2 = df2.reindex(mux, method=&#39;ffill&#39;).reset_index()

print (df2)
id  cycle  Salary Children  Days
0    1      1       7       No   123
1    1      2       7      Yes   128
2    1      3       7      Yes    66
3    1      4       8      Yes    66
4    1      5       9      Yes   120
5    1      6      10       No   141
6    1      7      11       No    52
7    1      8      12      Yes    96
8    1      9      13      Yes   120
9    1     10      14      Yes   141
10   1     11      15       No    52
11   2      1       4      Yes    96
12   2      2       5       No   120
13   2      3       5       No   120
14   2      4       5       No   120
15   3      1       8       No    15
16   3      2       9      Yes   123
17   3      3      10      Yes   128
18   3      4      11       No    66
19   3      5      12       No   120
20   3      6      13      Yes   141
21   4      1       8      Yes   141
22   4      2       8      Yes   141
23   4      3       8      Yes   141
24   4      4       8      Yes   141
25   5      1       7       No   123
26   5      2       7      Yes   128
27   5      3       9       No    66
28   5      4      10       No   123
29   5      5      11      Yes   128
30   5      6      12      Yes    66
31   5      7      13      Yes   120
32   5      8      14      Yes   141
33   5      9      15       No    52

答案2

得分: 0

以下是已翻译的代码部分：

import numpy as np

cycle_threshold = 3

def fix_cycle(df):
    lgt = len(df)
    if lgt < cycle_threshold:
        df1 = df.reindex(df.index[-1].repeat(cycle_threshold - lgt + 1))
        df1['cycle'] += np.arange(len(df1)) + 1
        return df1

df1 = df.groupby('id').apply(fix_cycle).droplevel(0)
out = pd.concat([df, df1]).sort_index(kind='stable', ignore_index=True)

输出:

>>> out
    id  cycle  Salary Children  Days
0    1      1       7       No   123
1    1      2       7      Yes   128
2    1      3       7      Yes    66
3    1      4       8      Yes    66
4    1      5       9      Yes   120
5    1      6      10       No   141
6    1      7      11       No    52
7    1      8      12      Yes    96
8    1      9      13      Yes   120
9    1     10      14      Yes   141
10   1     11      15       No    52
11   2      1       4      Yes    96
12   2      2       5       No   120
13   2      3       5       No   120  # &lt;- 这里
14   2      4       5       No   120  # &lt;- 这里
15   3      1       8       No    15
16   3      2       9      Yes   123
17   3      3      10      Yes   128
18   3      4      11       No    66
19   3      5      12       No   120
20   3      6      13      Yes   141
21   4      1       8      Yes   141
22   4      2       8      Yes   141  # &lt;- 这里
23   4      3       8      Yes   141  # &lt;- 这里
24   4      4       8      Yes   141  # &lt;- 这里
25   5      1       7       No   123
26   5      2       7      Yes   128
27   5      3       9       No    66
28   5      4      10       No   123
29   5      5      11      Yes   128
30   5      6      12      Yes    66
31   5      7      13      Yes   120
32   5      8      14      Yes   141
33   5      9      15       No    52

详细信息:

>>> df1
    id  cycle  Salary Children  Days
12   2      3       5       No   120
12   2      4       5       No   120
19   4      2       8      Yes   141
19   4      3       8      Yes   141
19   4      4       8      Yes   141

英文:

Create a dataframe with repeated rows then concatenate to the original one and finally sort it:

import numpy as np
cycle_threshold = 3
def fix_cycle(df):
lgt = len(df)
if lgt &lt; cycle_threshold:
df1 = df.reindex(df.index[-1].repeat(cycle_threshold - lgt + 1))
df1[&#39;cycle&#39;] += np.arange(len(df1)) + 1
return df1
df1 = df.groupby(&#39;id&#39;).apply(fix_cycle).droplevel(0)
out = pd.concat([df, df1]).sort_index(kind=&#39;stable&#39;, ignore_index=True)

Output:

&gt;&gt;&gt; out
id  cycle  Salary Children  Days
0    1      1       7       No   123
1    1      2       7      Yes   128
2    1      3       7      Yes    66
3    1      4       8      Yes    66
4    1      5       9      Yes   120
5    1      6      10       No   141
6    1      7      11       No    52
7    1      8      12      Yes    96
8    1      9      13      Yes   120
9    1     10      14      Yes   141
10   1     11      15       No    52
11   2      1       4      Yes    96
12   2      2       5       No   120
13   2      3       5       No   120  # &lt;- HERE
14   2      4       5       No   120  # &lt;- HERE
15   3      1       8       No    15
16   3      2       9      Yes   123
17   3      3      10      Yes   128
18   3      4      11       No    66
19   3      5      12       No   120
20   3      6      13      Yes   141
21   4      1       8      Yes   141
22   4      2       8      Yes   141  # &lt;- HERE
23   4      3       8      Yes   141  # &lt;- HERE
24   4      4       8      Yes   141  # &lt;- HERE
25   5      1       7       No   123
26   5      2       7      Yes   128
27   5      3       9       No    66
28   5      4      10       No   123
29   5      5      11      Yes   128
30   5      6      12      Yes    66
31   5      7      13      Yes   120
32   5      8      14      Yes   141
33   5      9      15       No    52

Details:

&gt;&gt;&gt; df1
id  cycle  Salary Children  Days
12   2      3       5       No   120
12   2      4       5       No   120
19   4      2       8      Yes   141
19   4      3       8      Yes   141
19   4      4       8      Yes   141

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在数据框中，根据特定条件复制每个组/ID的最后一行。

问题

答案1

答案2

将sklearn的警告保存到数据框中

Asyncssh在使用多个进程时无法创建通道。

使用`each`还是`times`来处理R中的向量？

需要Python重新索引Elasticsearch索引。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论