2023年6月12日 14:50:06go评论48阅读模式

英文:

Find rows where there are duplicates in another column in the designated group of another column

问题

对于数据集 df，我想按照列 B 中的两个组 foo 和 bar 进行分组，并识别同时存在于两个组中的重复行。我该如何实现这个目标？

df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 1],
                   'B': ['foo', 'bar', 'foo', 'bar', 'foo', 'foo']})
df = df.sort_values('B')
df

期望的结果:

       A    B  Indicator
    1  2  bar  True  # 值 2 也存在于 foo 中，因此返回 True
    3  3  bar  True  # 值 3 也存在于 foo 中，因此返回 True
    0  1  foo  False  # 值 1 只存在于 foo 中，因此返回 False
    2  2  foo  True  # 值 2 也存在于 bar 中，因此返回 True
    4  3  foo  True  # 值 3 也存在于 bar 中，因此返回 True
    5  1  foo  False  # 值 1 只存在于 foo 中，因此返回 False

更新：

假设列 B 有超过 2 个类别，示例数据 df 如下：

df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 2, 1],  'B': ['foo', 'bar', 'foo', 'bar', 'foo', 'baz', 'baz']})
df = df.sort_values('B')
df

在这种情况下，期望的结果如下：

       A    B  Indicator
    1  2  bar  True  # 值 2 出现在 baz、bar 和 foo 中，因此返回 True
    3  3  bar  False  # 值 3 只出现在 bar 和 foo 中，因此返回 False
    5  2  baz  True  # 值 2 出现在 baz、bar 和 foo 中，因此返回 True
    6  1  baz  False  # 值 1 只出现在 baz 和 foo 中，因此返回 False
    0  1  foo  False  # 值 1 只出现在 baz 和 foo 中，因此返回 False
    2  2  foo  True  # 值 2 出现在 baz、bar 和 foo 中，因此返回 True
    4  3  foo  False  # 值 3 只出现在 bar 和 foo 中，因此返回 False

英文:

For the data set df, I want to groupby two groups of foo and bar in column B, and identify the duplicated rows that exist in both groups. How can I achieve this?

df = pd.DataFrame({&#39;A&#39;: [1, 2, 2, 3, 3, 1],
                   &#39;B&#39;: [&#39;foo&#39;, &#39;bar&#39;, &#39;foo&#39;, &#39;bar&#39;, &#39;foo&#39;, &#39;foo&#39;]})
df = df.sort_values(&#39;B&#39;)
df
Out[15]: 
   A    B
1  2  bar
3  3  bar
0  1  foo
2  2  foo
4  3  foo
5  1  foo

The expected result:

   A    B  Indicator
1  2  bar  True  # value 2 also present in foo, so returns True
3  3  bar  True  # value 3 also present in foo, so returns True
0  1  foo  False  # value 1 only present in foo, so returns False
2  2  foo  True  # value 2 also present in bar, so returns True
4  3  foo  True  # value 3 also present in bar, so returns True
5  1  foo  False  # value 1 only present in foo, so returns False

Updates:

Assuming column B has more than 2 categories, the sample data df is as follows:

df = pd.DataFrame({&#39;A&#39;: [1, 2, 2, 3, 3, 2, 1],  &#39;B&#39;: [&#39;foo&#39;, &#39;bar&#39;, &#39;foo&#39;, &#39;bar&#39;, &#39;foo&#39;, &#39;baz&#39;, &#39;baz&#39;]})
df = df.sort_values(&#39;B&#39;)
df
Out[30]: 
   A    B
1  2  bar
3  3  bar
5  2  baz
6  1  baz
0  1  foo
2  2  foo
4  3  foo

In this case, the expected result would be as follows:

   A    B  Indicator
1  2  bar  True  # The value 2 occurs in categories baz, bar, and foo, so returns True.
3  3  bar  False  # The value 3 only occurs in categories bar and foo, so returns False.
5  2  baz  True  # The value 2 occurs in categories baz, bar, and foo, so returns True.
6  1  baz  False  # The value 1 only occurs in categories baz and foo, so returns False.
0  1  foo  False  # The value 1 only occurs in categories baz and foo, so returns False.
2  2  foo  True  # The value 2 occurs in categories baz, bar, and foo, so returns True.
4  3  foo  False  # The value 3 only occurs in categories bar and foo, so returns False.

答案1

得分: 4

由于您有多个组，您可以使用以下代码：

data = {'A': [2, 3, 2, 1, 1, 2, 3],
        'B': ['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'foo']}
df = pd.DataFrame(data).sort_values('B')

df['Indicator'] = df.groupby('A')['B'].transform('nunique') == df['B'].nunique()

输出：

>>> df
   A    B  Indicator
0  2  bar       True
1  3  bar      False
2  2  baz       True
3  1  baz      False
4  1  foo      False
5  2  foo       True
6  3  foo      False

英文:

Since you have multiple groups, you can use:

data = {&#39;A&#39;: [2, 3, 2, 1, 1, 2, 3],
        &#39;B&#39;: [&#39;bar&#39;, &#39;bar&#39;, &#39;baz&#39;, &#39;baz&#39;, &#39;foo&#39;, &#39;foo&#39;, &#39;foo&#39;]}
df = pd.DataFrame(data).sort_values(&#39;B&#39;)

df[&#39;Indicator&#39;] = df.groupby(&#39;A&#39;)[&#39;B&#39;].transform(&#39;nunique&#39;) == df[&#39;B&#39;].nunique()

Output:

&gt;&gt;&gt; df
   A    B  Indicator
0  2  bar       True
1  3  bar      False
2  2  baz       True
3  1  baz      False
4  1  foo      False
5  2  foo       True
6  3  foo      False

答案2

得分: 1

如果需要对所有A组使用交集B值，请使用以下方法：

首先的想法是使用crosstab 来获取在每个组中存在的A值，并使用 Series.isin 进行过滤：

df1 = pd.crosstab(df.A, df.B).astype(bool)

df['Indicator'] = df['A'].isin(df1.index[df1.all(axis=1)])
print(df)
   A    B  Indicator
1  2  bar       True
3  3  bar       True
0  1  foo      False
2  2  foo       True
4  3  foo       True
5  1  foo      False

或者，对于问题中的最后一个DataFrame，可以按B分组使用集合的交集：

setlist = df.groupby('B')['A'].agg(set)
print(setlist)
B
bar       {2, 3}
baz       {1, 2}
foo    {1, 2, 3}
Name: A, dtype: object

u = set.intersection(*setlist)
print(u)
{2}

df['Indicator'] = df['A'].isin(u)
print(df)
   A    B  Indicator
1  2  bar       True
3  3  bar      False
5  2  baz       True
6  1  baz      False
0  1  foo      False
2  2  foo       True
4  3  foo      False
5  1  foo      False

英文:

If need intersection B values per all A groups use:

First idea is use crosstab for get A values if exist in each group and filter A values in Series.isin:

df1 = pd.crosstab(df.A, df.B).astype(bool)

df[&#39;Indicator&#39;] = df[&#39;A&#39;].isin(df1.index[df1.all(axis=1)])
print(df)
   A    B  Indicator
1  2  bar       True
3  3  bar       True
0  1  foo      False
2  2  foo       True
4  3  foo       True
5  1  foo      False

Or use intersection of sets per groups by B for last DataFrame in question:

setlist = df.groupby(&#39;B&#39;)[&#39;A&#39;].agg(set)
print (setlist)
B
bar       {2, 3}
baz       {1, 2}
foo    {1, 2, 3}
Name: A, dtype: object

u = set.intersection(*setlist)
print (u)
{2}

df[&#39;Indicator&#39;] = df[&#39;A&#39;].isin(u)
print (df)
   A    B  Indicator
1  2  bar       True
3  3  bar      False
5  2  baz       True
6  1  baz      False
0  1  foo      False
2  2  foo       True
4  3  foo      False
5  1  foo      False

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在另一列的指定组中查找存在重复项的行。

问题

答案1

答案2

如何重命名数据框索引并使其从1开始计数，而不破坏标题？

为每个plt.step线条分配不同的颜色。

导入Flask中的文件

Pandas 查询 astype(str)

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论