在数据框中,根据特定条件复制每个组/ID的最后一行。

huangapple go评论65阅读模式
英文:

Duplicate last row of a group/id in a dataframe, based on certain condition

问题

以下是要翻译的代码部分:

#Load the required libraries
import pandas as pd

#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1,
               2, 2,
               3, 3, 3, 3, 3, 3,
               4, 
               5, 5, 5, 5, 5,5, 5, 5,5],
        'cycle': [1,2, 3, 4, 5,6,7,8,9,10,11,
                  1,2, 
                  1,2, 3, 4, 5,6,
                  1,
                  1,2, 3, 4, 5,6,7,8,9,],
        'Salary': [7, 7, 7,8,9,10,11,12,13,14,15,
                   4, 5,
                   8,9,10,11,12,13,
                   8,
                   7, 7,9,10,11,12,13,14,15,],
        'Children': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No',
                     'Yes', 'No', 
                     'No','Yes', 'Yes', 'No','No', 'Yes',
                     'Yes',
                      'No',  'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No',],
        'Days': [123, 128, 66, 66, 120, 141, 52,96, 120, 141, 52,
                 96, 120,
                 15,123, 128, 66, 120, 141,
                 141,
                 123, 128, 66, 123, 128, 66, 120, 141, 52,],
        }

#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)

请注意,此代码是用HTML编码嵌套在字符串中的,我已经将其解码为Python代码。

英文:

I have the below dataframe:

#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1,
2, 2,
3, 3, 3, 3, 3, 3,
4, 
5, 5, 5, 5, 5,5, 5, 5,5],
'cycle': [1,2, 3, 4, 5,6,7,8,9,10,11,
1,2, 
1,2, 3, 4, 5,6,
1,
1,2, 3, 4, 5,6,7,8,9,],
'Salary': [7, 7, 7,8,9,10,11,12,13,14,15,
4, 5,
8,9,10,11,12,13,
8,
7, 7,9,10,11,12,13,14,15,],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No',
'Yes', 'No', 
'No','Yes', 'Yes', 'No','No', 'Yes',
'Yes',
'No',  'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No',],
'Days': [123, 128, 66, 66, 120, 141, 52,96, 120, 141, 52,
96, 120,
15,123, 128, 66, 120, 141,
141,
123, 128, 66, 123, 128, 66, 120, 141, 52,],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)

The above dataframe looks as such:

在数据框中,根据特定条件复制每个组/ID的最后一行。

Here, every id has different cycles as per the 'cycle' column. For example,

id-1 has maximum 11 cycles.

id-2 has maximum 2 cycle.

id-3 has maximum 6 cycles.

id-4 has maximum 1 cycle.

id-5 has maximum 9 cycles.

I have a certain threshold limit on 'cycles'. Let's say cycle_threshold = 3

If maximum cycles for and id <= cycle_threshold, then duplicate the last row of that group/id till the cycle reaches 4.

For example,

For id-2, since maximum cycles are 2 (which is less than 3), repeat the last row of id-2, till the cycle becomes 4.

For id-4, since maximum cycles are 1 (which is less than 3), repeat the last row of id-4, till the cycle becomes 4.

The rest of the group's/id's remain as it is.

The result looks as such:

在数据框中,根据特定条件复制每个组/ID的最后一行。

Can somebody please let me know how do I achieve this task in Python?

答案1

得分: 0

你可以通过比较每个分组中的cycle的最大值来确定id,然后对于筛选后的分组,使用MultiIndex.from_product创建MultiIndex,并使用DataFrame.reindex和参数method='ffill'重复最后一个值:

cycle_threshold = 3
M = 4

# 获取每个id的最大cycle
s = df.groupby('id')['cycle'].max()

# 获取小于或等于阈值的id
ids = s.index
展开收缩
# 检查id是否匹配id列表 mask = df['id'].isin(ids) # 为筛选后的分组创建MultiIndex,其中cycle范围是1到M + 1 df1 = df[mask].set_index(['id', 'cycle']) mux = pd.MultiIndex.from_product([ids, range(1, M + 1)], names=['id', 'cycle']) # 按id和cycle排序,追加新的分组 df2 = (pd.concat([df[~mask], df1.reindex(mux, method='ffill').reset_index()]) .sort_values(['id', 'cycle'], ignore_index=True) )

另一个想法是创建最终的MultiIndex并重新索引所有行:

cycle_threshold = 3
M = 4

s = df.groupby('id')['cycle'].max()
ids = s.index
展开收缩
mask = df['id'].isin(ids) df2 = df.set_index(['id', 'cycle']) mux = (pd.MultiIndex.from_product([ids, range(1, 5)], names=['id', 'cycle']).append(df2[~mask.to_numpy()].index) .sort_values()) df2 = df2.reindex(mux, method='ffill').reset_index()

这些代码将生成一个名为df2的DataFrame,其中包含重新排列和填充的数据。

英文:

You can idetify id by compare maximal values of cycle per groups and for filtered groups create MultiIndex.from_product and reepat last value by DataFrame.reindex with parameter method='ffill':

cycle_threshold = 3
M = 4
#get maximal cycle per id
s = df.groupby(&#39;id&#39;)[&#39;cycle&#39;].max()
#get id less or equal threshold
ids = s.index
展开收缩
#test groups by id match list ids mask = df[&#39;id&#39;].isin(ids) #create MultiIndex with range(1,M + 1) for filtered groups df1 = df[mask].set_index([&#39;id&#39;,&#39;cycle&#39;]) mux = pd.MultiIndex.from_product([ids, range(1, M + 1)], names=[&#39;id&#39;,&#39;cycle&#39;]) #append new groups with sorting by id and cycle df2 = (pd.concat([df[~mask], df1.reindex(mux, method=&#39;ffill&#39;).reset_index()]) .sort_values([&#39;id&#39;,&#39;cycle&#39;], ignore_index=True))

Another idea is create final MultiIndex and reindex all rows:

cycle_threshold = 3
M = 4
s = df.groupby(&#39;id&#39;)[&#39;cycle&#39;].max()
ids = s.index
展开收缩
mask = df[&#39;id&#39;].isin(ids) df2 = df.set_index([&#39;id&#39;,&#39;cycle&#39;]) mux = (pd.MultiIndex.from_product([ids, range(1, 5)], names=[&#39;id&#39;,&#39;cycle&#39;]).append(df2[~mask.to_numpy()].index) .sort_values()) df2 = df2.reindex(mux, method=&#39;ffill&#39;).reset_index()

print (df2)
id  cycle  Salary Children  Days
0    1      1       7       No   123
1    1      2       7      Yes   128
2    1      3       7      Yes    66
3    1      4       8      Yes    66
4    1      5       9      Yes   120
5    1      6      10       No   141
6    1      7      11       No    52
7    1      8      12      Yes    96
8    1      9      13      Yes   120
9    1     10      14      Yes   141
10   1     11      15       No    52
11   2      1       4      Yes    96
12   2      2       5       No   120
13   2      3       5       No   120
14   2      4       5       No   120
15   3      1       8       No    15
16   3      2       9      Yes   123
17   3      3      10      Yes   128
18   3      4      11       No    66
19   3      5      12       No   120
20   3      6      13      Yes   141
21   4      1       8      Yes   141
22   4      2       8      Yes   141
23   4      3       8      Yes   141
24   4      4       8      Yes   141
25   5      1       7       No   123
26   5      2       7      Yes   128
27   5      3       9       No    66
28   5      4      10       No   123
29   5      5      11      Yes   128
30   5      6      12      Yes    66
31   5      7      13      Yes   120
32   5      8      14      Yes   141
33   5      9      15       No    52

答案2

得分: 0

以下是已翻译的代码部分:

import numpy as np

cycle_threshold = 3

def fix_cycle(df):
    lgt = len(df)
    if lgt < cycle_threshold:
        df1 = df.reindex(df.index[-1].repeat(cycle_threshold - lgt + 1))
        df1['cycle'] += np.arange(len(df1)) + 1
        return df1

df1 = df.groupby('id').apply(fix_cycle).droplevel(0)
out = pd.concat([df, df1]).sort_index(kind='stable', ignore_index=True)

输出:

>>> out
    id  cycle  Salary Children  Days
0    1      1       7       No   123
1    1      2       7      Yes   128
2    1      3       7      Yes    66
3    1      4       8      Yes    66
4    1      5       9      Yes   120
5    1      6      10       No   141
6    1      7      11       No    52
7    1      8      12      Yes    96
8    1      9      13      Yes   120
9    1     10      14      Yes   141
10   1     11      15       No    52
11   2      1       4      Yes    96
12   2      2       5       No   120
13   2      3       5       No   120  # &lt;- 这里
14   2      4       5       No   120  # &lt;- 这里
15   3      1       8       No    15
16   3      2       9      Yes   123
17   3      3      10      Yes   128
18   3      4      11       No    66
19   3      5      12       No   120
20   3      6      13      Yes   141
21   4      1       8      Yes   141
22   4      2       8      Yes   141  # &lt;- 这里
23   4      3       8      Yes   141  # &lt;- 这里
24   4      4       8      Yes   141  # &lt;- 这里
25   5      1       7       No   123
26   5      2       7      Yes   128
27   5      3       9       No    66
28   5      4      10       No   123
29   5      5      11      Yes   128
30   5      6      12      Yes    66
31   5      7      13      Yes   120
32   5      8      14      Yes   141
33   5      9      15       No    52

详细信息:

>>> df1
    id  cycle  Salary Children  Days
12   2      3       5       No   120
12   2      4       5       No   120
19   4      2       8      Yes   141
19   4      3       8      Yes   141
19   4      4       8      Yes   141
英文:

Create a dataframe with repeated rows then concatenate to the original one and finally sort it:

import numpy as np
cycle_threshold = 3
def fix_cycle(df):
lgt = len(df)
if lgt &lt; cycle_threshold:
df1 = df.reindex(df.index[-1].repeat(cycle_threshold - lgt + 1))
df1[&#39;cycle&#39;] += np.arange(len(df1)) + 1
return df1
df1 = df.groupby(&#39;id&#39;).apply(fix_cycle).droplevel(0)
out = pd.concat([df, df1]).sort_index(kind=&#39;stable&#39;, ignore_index=True)

Output:

&gt;&gt;&gt; out
id  cycle  Salary Children  Days
0    1      1       7       No   123
1    1      2       7      Yes   128
2    1      3       7      Yes    66
3    1      4       8      Yes    66
4    1      5       9      Yes   120
5    1      6      10       No   141
6    1      7      11       No    52
7    1      8      12      Yes    96
8    1      9      13      Yes   120
9    1     10      14      Yes   141
10   1     11      15       No    52
11   2      1       4      Yes    96
12   2      2       5       No   120
13   2      3       5       No   120  # &lt;- HERE
14   2      4       5       No   120  # &lt;- HERE
15   3      1       8       No    15
16   3      2       9      Yes   123
17   3      3      10      Yes   128
18   3      4      11       No    66
19   3      5      12       No   120
20   3      6      13      Yes   141
21   4      1       8      Yes   141
22   4      2       8      Yes   141  # &lt;- HERE
23   4      3       8      Yes   141  # &lt;- HERE
24   4      4       8      Yes   141  # &lt;- HERE
25   5      1       7       No   123
26   5      2       7      Yes   128
27   5      3       9       No    66
28   5      4      10       No   123
29   5      5      11      Yes   128
30   5      6      12      Yes    66
31   5      7      13      Yes   120
32   5      8      14      Yes   141
33   5      9      15       No    52

Details:

&gt;&gt;&gt; df1
id  cycle  Salary Children  Days
12   2      3       5       No   120
12   2      4       5       No   120
19   4      2       8      Yes   141
19   4      3       8      Yes   141
19   4      4       8      Yes   141

huangapple
  • 本文由 发表于 2023年4月4日 15:13:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926466.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定