英文:
Duplicate last row of a group/id in a dataframe, based on certain condition
问题
以下是要翻译的代码部分:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1,
2, 2,
3, 3, 3, 3, 3, 3,
4,
5, 5, 5, 5, 5,5, 5, 5,5],
'cycle': [1,2, 3, 4, 5,6,7,8,9,10,11,
1,2,
1,2, 3, 4, 5,6,
1,
1,2, 3, 4, 5,6,7,8,9,],
'Salary': [7, 7, 7,8,9,10,11,12,13,14,15,
4, 5,
8,9,10,11,12,13,
8,
7, 7,9,10,11,12,13,14,15,],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No',
'Yes', 'No',
'No','Yes', 'Yes', 'No','No', 'Yes',
'Yes',
'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No',],
'Days': [123, 128, 66, 66, 120, 141, 52,96, 120, 141, 52,
96, 120,
15,123, 128, 66, 120, 141,
141,
123, 128, 66, 123, 128, 66, 120, 141, 52,],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
请注意,此代码是用HTML编码嵌套在字符串中的,我已经将其解码为Python代码。
英文:
I have the below dataframe:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1,
2, 2,
3, 3, 3, 3, 3, 3,
4,
5, 5, 5, 5, 5,5, 5, 5,5],
'cycle': [1,2, 3, 4, 5,6,7,8,9,10,11,
1,2,
1,2, 3, 4, 5,6,
1,
1,2, 3, 4, 5,6,7,8,9,],
'Salary': [7, 7, 7,8,9,10,11,12,13,14,15,
4, 5,
8,9,10,11,12,13,
8,
7, 7,9,10,11,12,13,14,15,],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No',
'Yes', 'No',
'No','Yes', 'Yes', 'No','No', 'Yes',
'Yes',
'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No',],
'Days': [123, 128, 66, 66, 120, 141, 52,96, 120, 141, 52,
96, 120,
15,123, 128, 66, 120, 141,
141,
123, 128, 66, 123, 128, 66, 120, 141, 52,],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such:
Here, every id has different cycles as per the 'cycle' column. For example,
id-1 has maximum 11 cycles.
id-2 has maximum 2 cycle.
id-3 has maximum 6 cycles.
id-4 has maximum 1 cycle.
id-5 has maximum 9 cycles.
I have a certain threshold limit on 'cycles'. Let's say cycle_threshold = 3
If maximum cycles for and id <= cycle_threshold, then duplicate the last row of that group/id till the cycle reaches 4.
For example,
For id-2, since maximum cycles are 2 (which is less than 3), repeat the last row of id-2, till the cycle becomes 4.
For id-4, since maximum cycles are 1 (which is less than 3), repeat the last row of id-4, till the cycle becomes 4.
The rest of the group's/id's remain as it is.
The result looks as such:
Can somebody please let me know how do I achieve this task in Python?
答案1
得分: 0
你可以通过比较每个分组中的cycle
的最大值来确定id
,然后对于筛选后的分组,使用MultiIndex.from_product
创建MultiIndex,并使用DataFrame.reindex
和参数method='ffill'
重复最后一个值:
cycle_threshold = 3
M = 4
# 获取每个id的最大cycle
s = df.groupby('id')['cycle'].max()
# 获取小于或等于阈值的id
ids = s.index展开收缩
# 检查id是否匹配id列表
mask = df['id'].isin(ids)
# 为筛选后的分组创建MultiIndex,其中cycle范围是1到M + 1
df1 = df[mask].set_index(['id', 'cycle'])
mux = pd.MultiIndex.from_product([ids, range(1, M + 1)], names=['id', 'cycle'])
# 按id和cycle排序,追加新的分组
df2 = (pd.concat([df[~mask],
df1.reindex(mux, method='ffill').reset_index()])
.sort_values(['id', 'cycle'], ignore_index=True)
)
另一个想法是创建最终的MultiIndex并重新索引所有行:
cycle_threshold = 3
M = 4
s = df.groupby('id')['cycle'].max()
ids = s.index展开收缩
mask = df['id'].isin(ids)
df2 = df.set_index(['id', 'cycle'])
mux = (pd.MultiIndex.from_product([ids, range(1, 5)],
names=['id', 'cycle']).append(df2[~mask.to_numpy()].index)
.sort_values())
df2 = df2.reindex(mux, method='ffill').reset_index()
这些代码将生成一个名为df2
的DataFrame,其中包含重新排列和填充的数据。
英文:
You can idetify id
by compare maximal values of cycle
per groups and for filtered groups create MultiIndex.from_product
and reepat last value by DataFrame.reindex
with parameter method='ffill':
cycle_threshold = 3
M = 4
#get maximal cycle per id
s = df.groupby('id')['cycle'].max()
#get id less or equal threshold
ids = s.index展开收缩
#test groups by id match list ids
mask = df['id'].isin(ids)
#create MultiIndex with range(1,M + 1) for filtered groups
df1 = df[mask].set_index(['id','cycle'])
mux = pd.MultiIndex.from_product([ids, range(1, M + 1)], names=['id','cycle'])
#append new groups with sorting by id and cycle
df2 = (pd.concat([df[~mask],
df1.reindex(mux, method='ffill').reset_index()])
.sort_values(['id','cycle'], ignore_index=True))
Another idea is create final MultiIndex and reindex all rows:
cycle_threshold = 3
M = 4
s = df.groupby('id')['cycle'].max()
ids = s.index展开收缩
mask = df['id'].isin(ids)
df2 = df.set_index(['id','cycle'])
mux = (pd.MultiIndex.from_product([ids, range(1, 5)],
names=['id','cycle']).append(df2[~mask.to_numpy()].index)
.sort_values())
df2 = df2.reindex(mux, method='ffill').reset_index()
print (df2)
id cycle Salary Children Days
0 1 1 7 No 123
1 1 2 7 Yes 128
2 1 3 7 Yes 66
3 1 4 8 Yes 66
4 1 5 9 Yes 120
5 1 6 10 No 141
6 1 7 11 No 52
7 1 8 12 Yes 96
8 1 9 13 Yes 120
9 1 10 14 Yes 141
10 1 11 15 No 52
11 2 1 4 Yes 96
12 2 2 5 No 120
13 2 3 5 No 120
14 2 4 5 No 120
15 3 1 8 No 15
16 3 2 9 Yes 123
17 3 3 10 Yes 128
18 3 4 11 No 66
19 3 5 12 No 120
20 3 6 13 Yes 141
21 4 1 8 Yes 141
22 4 2 8 Yes 141
23 4 3 8 Yes 141
24 4 4 8 Yes 141
25 5 1 7 No 123
26 5 2 7 Yes 128
27 5 3 9 No 66
28 5 4 10 No 123
29 5 5 11 Yes 128
30 5 6 12 Yes 66
31 5 7 13 Yes 120
32 5 8 14 Yes 141
33 5 9 15 No 52
答案2
得分: 0
以下是已翻译的代码部分:
import numpy as np
cycle_threshold = 3
def fix_cycle(df):
lgt = len(df)
if lgt < cycle_threshold:
df1 = df.reindex(df.index[-1].repeat(cycle_threshold - lgt + 1))
df1['cycle'] += np.arange(len(df1)) + 1
return df1
df1 = df.groupby('id').apply(fix_cycle).droplevel(0)
out = pd.concat([df, df1]).sort_index(kind='stable', ignore_index=True)
输出:
>>> out
id cycle Salary Children Days
0 1 1 7 No 123
1 1 2 7 Yes 128
2 1 3 7 Yes 66
3 1 4 8 Yes 66
4 1 5 9 Yes 120
5 1 6 10 No 141
6 1 7 11 No 52
7 1 8 12 Yes 96
8 1 9 13 Yes 120
9 1 10 14 Yes 141
10 1 11 15 No 52
11 2 1 4 Yes 96
12 2 2 5 No 120
13 2 3 5 No 120 # <- 这里
14 2 4 5 No 120 # <- 这里
15 3 1 8 No 15
16 3 2 9 Yes 123
17 3 3 10 Yes 128
18 3 4 11 No 66
19 3 5 12 No 120
20 3 6 13 Yes 141
21 4 1 8 Yes 141
22 4 2 8 Yes 141 # <- 这里
23 4 3 8 Yes 141 # <- 这里
24 4 4 8 Yes 141 # <- 这里
25 5 1 7 No 123
26 5 2 7 Yes 128
27 5 3 9 No 66
28 5 4 10 No 123
29 5 5 11 Yes 128
30 5 6 12 Yes 66
31 5 7 13 Yes 120
32 5 8 14 Yes 141
33 5 9 15 No 52
详细信息:
>>> df1
id cycle Salary Children Days
12 2 3 5 No 120
12 2 4 5 No 120
19 4 2 8 Yes 141
19 4 3 8 Yes 141
19 4 4 8 Yes 141
英文:
Create a dataframe with repeated rows then concatenate to the original one and finally sort it:
import numpy as np
cycle_threshold = 3
def fix_cycle(df):
lgt = len(df)
if lgt < cycle_threshold:
df1 = df.reindex(df.index[-1].repeat(cycle_threshold - lgt + 1))
df1['cycle'] += np.arange(len(df1)) + 1
return df1
df1 = df.groupby('id').apply(fix_cycle).droplevel(0)
out = pd.concat([df, df1]).sort_index(kind='stable', ignore_index=True)
Output:
>>> out
id cycle Salary Children Days
0 1 1 7 No 123
1 1 2 7 Yes 128
2 1 3 7 Yes 66
3 1 4 8 Yes 66
4 1 5 9 Yes 120
5 1 6 10 No 141
6 1 7 11 No 52
7 1 8 12 Yes 96
8 1 9 13 Yes 120
9 1 10 14 Yes 141
10 1 11 15 No 52
11 2 1 4 Yes 96
12 2 2 5 No 120
13 2 3 5 No 120 # <- HERE
14 2 4 5 No 120 # <- HERE
15 3 1 8 No 15
16 3 2 9 Yes 123
17 3 3 10 Yes 128
18 3 4 11 No 66
19 3 5 12 No 120
20 3 6 13 Yes 141
21 4 1 8 Yes 141
22 4 2 8 Yes 141 # <- HERE
23 4 3 8 Yes 141 # <- HERE
24 4 4 8 Yes 141 # <- HERE
25 5 1 7 No 123
26 5 2 7 Yes 128
27 5 3 9 No 66
28 5 4 10 No 123
29 5 5 11 Yes 128
30 5 6 12 Yes 66
31 5 7 13 Yes 120
32 5 8 14 Yes 141
33 5 9 15 No 52
Details:
>>> df1
id cycle Salary Children Days
12 2 3 5 No 120
12 2 4 5 No 120
19 4 2 8 Yes 141
19 4 3 8 Yes 141
19 4 4 8 Yes 141
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论