2023年4月13日 21:53:17go评论168阅读模式

英文:

How to drop duplicate rows using value_counts and also using a condition that uses the actual value in a column using pandas?

问题

Sure, here is the translated code part without additional content:

df = df.iloc[df.groupby(['a', 'c']).c.transform('size').mul(-1).argsort(kind='mergesort')]

英文:

I have the following dataframe. I want to group by a first. Within each group, I need to do a value count based on c and only pick the one with most counts if the value in c is not EMP. If the value in c is EMP, then I want to pick the one with the second most counts. If there is no other value than EMP, then it should be EMP as in the case where a = 4.

a        c
1        EMP
1        y
1        y
1        z
2        z
2        z
2        EMP
2        z
2        a
2        a
3        EMP
3        EMP
3        k
4        EMP
4        EMP
4        EMP

The expected result would be

a        c
1        y
2        z
3        k
4        EMP

This is what I have tried so far: I have managed to sort it in the order I need so I could take the first row. However, I cannot figure out how to implement the condition for EMP using a lambda function with the drop_duplicates function as there is only the keep=first or keep=last option.

df = df.iloc[df.groupby([&#39;a&#39;, &#39;c&#39;]).c.transform(&#39;size&#39;).mul(-1).argsort(kind=&#39;mergesort&#39;)]

Edit:

The mode solution worked, I have an additional question. My dataframe contains about 50 more columns, and I would like to have all of these columns in the end result as well, with the values corresponding to the rows picked using the mode operation and the first value that occurred for the EMP rows. How would I do that? Is there an easier solution than what is mentioned here where you create a dict of functions and pass it to agg? Using SQL for this is also fine.

答案1

得分: 1

以下是您要翻译的代码部分：

out = (df[df['c'].ne('EMP')]
       .groupby('a', sort=False)['c']
       .apply(lambda g: g.mode()[0])
       .reindex(df['a'].unique(), fill_value='EMP')
       .reset_index()
      )

def cust_mode(s):
    counts = s.value_counts(sort=False)
    if 'EMP' in counts.index:  # make the EMP count -1
        counts['EMP'] = -1
    return counts.idxmax()

out = df.groupby('a', as_index=False)['c'].agg(cust_mode)

import numpy as np

count = df.merge(df.value_counts().reset_index(name='count'))['count']
out = (df.iloc[np.lexsort([count, df['c'].ne('EMP')])]
         .groupby('a', as_index=False).last()
       )

希望这些对您有帮助。

英文:

One option could be to remove the EMP, get the mode per group, then reindex on unique a filling with 'EMP':

out = (df[df[&#39;c&#39;].ne(&#39;EMP&#39;)]
       .groupby(&#39;a&#39;, sort=False)[&#39;c&#39;]
       .apply(lambda g: g.mode()[0])
       .reindex(df[&#39;a&#39;].unique(), fill_value=&#39;EMP&#39;)
       .reset_index()
      )

Similar approach using a custom function:

def cust_mode(s):
    counts = s.value_counts(sort=False)
    if &#39;EMP&#39; in counts.index:  # make the EMP count -1
        counts[&#39;EMP&#39;] = -1
    return counts.idxmax()

out = df.groupby(&#39;a&#39;, as_index=False)[&#39;c&#39;].agg(cust_mode)

Another option would be to perform a custom sort by value_counts and moving the EMP to the top (using numpy.lexsort), then getting the last value per group (groupby.last):

import numpy as np

count = df.merge(df.value_counts().reset_index(name=&#39;count&#39;))[&#39;count&#39;]
out = (df.iloc[np.lexsort([count, df[&#39;c&#39;].ne(&#39;EMP&#39;)])]
         .groupby(&#39;a&#39;, as_index=False).last()
       )

Output:

答案2

得分: 1

这应该也可以工作：

(df.value_counts()
 .reset_index(name='count')
 .sort_values('c', key=lambda x: x.eq('EMP'))
 .groupby('a')[['a', 'c']].head(1)
 .sort_values('a'))

输出：

英文:

This should work as well:

(df.value_counts()
 .reset_index(name = &#39;count&#39;)
 .sort_values(&#39;c&#39;,key = lambda x: x.eq(&#39;EMP&#39;))
 .groupby(&#39;a&#39;)[[&#39;a&#39;,&#39;c&#39;]].head(1)
 .sort_values(&#39;a&#39;))

Output:

答案3

得分: 1

以下是翻译好的代码部分：

另一种可能的解决方案：

out = df.value_counts(['a', 'c']).reset_index()
pd.concat([
    out.loc[~out.duplicated(['a'], keep=False) & out['c'].eq('EMP')],
    out.loc[out['c'].ne('EMP')].sort_values('count', ascending=False)
    .drop_duplicates('a')])[['a', 'c']]

输出：

英文:

Another possible solution:

out = df.value_counts([&#39;a&#39;, &#39;c&#39;]).reset_index()
pd.concat([
    out.loc[~out.duplicated([&#39;a&#39;], keep=False) &amp; out[&#39;c&#39;].eq(&#39;EMP&#39;)],
    out.loc[out[&#39;c&#39;].ne(&#39;EMP&#39;)].sort_values(&#39;count&#39;, ascending=False)
    .drop_duplicates(&#39;a&#39;)])[[&#39;a&#39;, &#39;c&#39;]]

Output:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to drop duplicate rows using value_counts and also using a condition that uses the actual value in a column using pandas?

问题

答案1

答案2

答案3

访问 Python 装饰器的局部作用域中的变量

Why will python function max() return different outputs if float('NaN') value is permuted in a dictionary but key-max_value remains the same?

R-style formulas在实现GLM中的幂（即平方）时出现问题

如何将这个函数向量化到 pandas 或 polars 中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论