2023年5月25日 12:21:15go评论73阅读模式

英文:

Python group by columns and make sure values in group doesn't skip values in order of another dataframe

问题

你可以使用以下代码来实现你的需求：

import pandas as pd

# 创建数据框
df1 = pd.DataFrame({
    'Group1': ['G1','G1','G1','G1','G1','G2','G2','G2','G2','G2','G2'],
    'Group2': ['A1','A1','A1','A2','A2','A1','A1','A2','A2','A2','A2'],
    'Label': ['AA','BB','CC','AA','CC','BB','DD','AA','CC','DD','BB']
})

df2 = pd.DataFrame({
    'ID': [1, 2, 4, 5, 7],
    'Label_ref': ['AA','BB','CC','DD','EE']
})

# 将df2的Label_ref转换为列表
label_ref_list = df2['Label_ref'].tolist()

# 定义一个函数来检查Label是否符合要求
def check_label_order(labels):
    index = 0
    for label in labels:
        if label == label_ref_list[index]:
            index += 1
            if index == len(label_ref_list):
                return 1
    return 0

# 使用groupby和apply来应用函数并创建Flag列
df1['Flag'] = df1.groupby(['Group1', 'Group2'])['Label'].transform(check_label_order)

# 打印结果
print(df1)

这将给你期望的输出，包括Flag列，用于指示是否满足条件。

英文:

I have 2 dataframes with the following structures:

df1
Group1 Group2 Label
G1     A1    AA
G1     A1    BB
G1     A1    CC
G1     A2    AA
G1     A2    CC
G2     A1    BB
G2     A1    DD
G2     A2    AA
G2     A2    CC
G2     A2    DD
G2     A2    BB

df2
ID  Label_ref
1     AA
2     BB
4     CC
5     DD
7     EE

I want to group the df1 based on the Group1 and Group2 columns and check if the 'Label' column contains values from df2 Label_ref in order of ID.

Label on df1 doesn't need to have all values from Label_ref on df2, but the values Label on df1 can't skip any Label_ref values in the order of ID

Expected output:

The group Group1=G1, Group2=A1 doesn't skip any values from AA - CC. Therefore the rows corresponding to the group are flagged.

The group Group1=G1, Group2=A2 skips values from BB but has the value CC. Therefore the rows corresponding to the group are not flagged.

The group Group1=G2, Group2=A2 doesn't skip any values from AA - DD although they are not in order. Therefore the rows corresponding to the group are flagged.

Group1 Group2 Label  Flag
G1     A1    AA     1
G1     A1    BB     1
G1     A1    CC     1
G1     A2    AA     0
G1     A2    CC     0
G2     A1    BB     0
G2     A1    DD     0
G2     A2    AA     1
G2     A2    CC     1
G2     A2    DD     1
G2     A2    BB     1

I haven't been able to make much progress:

import pandas as pd
df1 = pd.DataFrame({
                &#39;Group1&#39;: [ &#39;G1&#39;,&#39;G1&#39;, &#39;G1&#39;,&#39;G1&#39;,&#39;G1&#39;,
                            &#39;G2&#39;,&#39;G2&#39;, &#39;G2&#39;,&#39;G2&#39;,&#39;G2&#39;,&#39;G2&#39;],
                &#39;Group2&#39;: [&#39;A1&#39;,&#39;A1&#39;,&#39;A1&#39;,&#39;A2&#39;,&#39;A2&#39;,
                            &#39;A1&#39;,&#39;A1&#39;,&#39;A2&#39;,&#39;A2&#39;,&#39;A2&#39;,&#39;A2&#39;],
                &#39;Label&#39;: [&#39;AA&#39;,&#39;BB&#39;,&#39;CC&#39;,&#39;AA&#39;,&#39;CC&#39;,&#39;BB&#39;,
                            &#39;DD&#39;,&#39;AA&#39;,&#39;CC&#39;,&#39;DD&#39;,&#39;BB&#39;]})
df2 = pd.DataFrame({
                &#39;ID&#39;: [ 1, 2, 4, 5, 7],
                &#39;Label_ref&#39;: [&#39;AA&#39;,&#39;BB&#39;,&#39;CC&#39;,&#39;DD&#39;,&#39;EE&#39;]})

A link to a solution or a function/method I can use to achieve this is appreciated

答案1

得分: 2

你可以使用一个映射系列：

# 创建映射系列
smap = pd.Series(*pd.factorize(df2.sort_values('ID')['Label_ref']))
df1['Flag'] = df1['Label'].map(smap)

# 检查每个组是否连续
is_consecutive = lambda x: (~x.diff().gt(1).any()).astype(int)
df1['Flag'] = df1.sort_values('Flag').groupby(['Group1', 'Group2'])['Flag'].transform(is_consecutive)

输出：

>>> df1
   Group1 Group2 Label  Flag
0      G1     A1    AA     1
1      G1     A1    BB     1
2      G1     A1    CC     1
3      G1     A2    AA     0
4      G1     A2    CC     0
5      G2     A1    BB     0
6      G2     A1    DD     0
7      G2     A2    AA     1
8      G2     A2    CC     1
9      G2     A2    DD     1
10     G2     A2    BB     1

英文:

You can use a mapping series:

# Create mapping series
smap = pd.Series(*pd.factorize(df2.sort_values(&#39;ID&#39;)[&#39;Label_ref&#39;]))
df1[&#39;Flag&#39;] = df1[&#39;Label&#39;].map(smap)

# Check if flags are consecutive for each group
is_consecutive = lambda x: (~x.diff().gt(1).any()).astype(int)
df1[&#39;Flag&#39;] = df1.sort_values(&#39;Flag&#39;).groupby([&#39;Group1&#39;, &#39;Group2&#39;])[&#39;Flag&#39;].transform(is_consecutive)

Output:

&gt;&gt;&gt; df1
   Group1 Group2 Label  Flag
0      G1     A1    AA     1
1      G1     A1    BB     1
2      G1     A1    CC     1
3      G1     A2    AA     0
4      G1     A2    CC     0
5      G2     A1    BB     0
6      G2     A1    DD     0
7      G2     A2    AA     1
8      G2     A2    CC     1
9      G2     A2    DD     1
10     G2     A2    BB     1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python按列分组，并确保分组中的值不会跳过另一个数据帧的顺序。

问题

答案1

从一个包含 n 个数字的列表中选择样本，不重复。

Interactive Brokers TWS API：通过app.reqMktData获取期货价格的问题

如何用一行代码创建一个循环列表？

如何在AWS QuickSight中安排或自动化数据集刷新。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论