英文:
Python group by columns and make sure values in group doesn't skip values in order of another dataframe
问题
你可以使用以下代码来实现你的需求:
import pandas as pd
# 创建数据框
df1 = pd.DataFrame({
'Group1': ['G1','G1','G1','G1','G1','G2','G2','G2','G2','G2','G2'],
'Group2': ['A1','A1','A1','A2','A2','A1','A1','A2','A2','A2','A2'],
'Label': ['AA','BB','CC','AA','CC','BB','DD','AA','CC','DD','BB']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4, 5, 7],
'Label_ref': ['AA','BB','CC','DD','EE']
})
# 将df2的Label_ref转换为列表
label_ref_list = df2['Label_ref'].tolist()
# 定义一个函数来检查Label是否符合要求
def check_label_order(labels):
index = 0
for label in labels:
if label == label_ref_list[index]:
index += 1
if index == len(label_ref_list):
return 1
return 0
# 使用groupby和apply来应用函数并创建Flag列
df1['Flag'] = df1.groupby(['Group1', 'Group2'])['Label'].transform(check_label_order)
# 打印结果
print(df1)
这将给你期望的输出,包括Flag列,用于指示是否满足条件。
英文:
I have 2 dataframes with the following structures:
df1
Group1 Group2 Label
G1 A1 AA
G1 A1 BB
G1 A1 CC
G1 A2 AA
G1 A2 CC
G2 A1 BB
G2 A1 DD
G2 A2 AA
G2 A2 CC
G2 A2 DD
G2 A2 BB
df2
ID Label_ref
1 AA
2 BB
4 CC
5 DD
7 EE
I want to group the df1
based on the Group1
and Group2
columns and check if the 'Label' column contains values from df2
Label_ref
in order of ID
.
Label
on df1
doesn't need to have all values from Label_ref
on df2
, but the values Label
on df1
can't skip any Label_ref
values in the order of ID
Expected output:
The group Group1=G1
, Group2=A1
doesn't skip any values from AA
- CC
. Therefore the rows corresponding to the group are flagged.
The group Group1=G1
, Group2=A2
skips values from BB
but has the value CC
. Therefore the rows corresponding to the group are not flagged.
The group Group1=G2
, Group2=A2
doesn't skip any values from AA
- DD
although they are not in order. Therefore the rows corresponding to the group are flagged.
Group1 Group2 Label Flag
G1 A1 AA 1
G1 A1 BB 1
G1 A1 CC 1
G1 A2 AA 0
G1 A2 CC 0
G2 A1 BB 0
G2 A1 DD 0
G2 A2 AA 1
G2 A2 CC 1
G2 A2 DD 1
G2 A2 BB 1
I haven't been able to make much progress:
import pandas as pd
df1 = pd.DataFrame({
'Group1': [ 'G1','G1', 'G1','G1','G1',
'G2','G2', 'G2','G2','G2','G2'],
'Group2': ['A1','A1','A1','A2','A2',
'A1','A1','A2','A2','A2','A2'],
'Label': ['AA','BB','CC','AA','CC','BB',
'DD','AA','CC','DD','BB']})
df2 = pd.DataFrame({
'ID': [ 1, 2, 4, 5, 7],
'Label_ref': ['AA','BB','CC','DD','EE']})
A link to a solution or a function/method I can use to achieve this is appreciated
答案1
得分: 2
你可以使用一个映射系列:
# 创建映射系列
smap = pd.Series(*pd.factorize(df2.sort_values('ID')['Label_ref']))
df1['Flag'] = df1['Label'].map(smap)
# 检查每个组是否连续
is_consecutive = lambda x: (~x.diff().gt(1).any()).astype(int)
df1['Flag'] = df1.sort_values('Flag').groupby(['Group1', 'Group2'])['Flag'].transform(is_consecutive)
输出:
>>> df1
Group1 Group2 Label Flag
0 G1 A1 AA 1
1 G1 A1 BB 1
2 G1 A1 CC 1
3 G1 A2 AA 0
4 G1 A2 CC 0
5 G2 A1 BB 0
6 G2 A1 DD 0
7 G2 A2 AA 1
8 G2 A2 CC 1
9 G2 A2 DD 1
10 G2 A2 BB 1
英文:
You can use a mapping series:
# Create mapping series
smap = pd.Series(*pd.factorize(df2.sort_values('ID')['Label_ref']))
df1['Flag'] = df1['Label'].map(smap)
# Check if flags are consecutive for each group
is_consecutive = lambda x: (~x.diff().gt(1).any()).astype(int)
df1['Flag'] = df1.sort_values('Flag').groupby(['Group1', 'Group2'])['Flag'].transform(is_consecutive)
Output:
>>> df1
Group1 Group2 Label Flag
0 G1 A1 AA 1
1 G1 A1 BB 1
2 G1 A1 CC 1
3 G1 A2 AA 0
4 G1 A2 CC 0
5 G2 A1 BB 0
6 G2 A1 DD 0
7 G2 A2 AA 1
8 G2 A2 CC 1
9 G2 A2 DD 1
10 G2 A2 BB 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论