2023年3月8日 16:15:07go评论121阅读模式

英文:

For each group in Pandas dataframe, return the most common value if it shows up more than `x%` of the time

问题

对于给定的 pandas 数据帧，如果某一列（字符串数据类型）的最常见值在超过 n% 的行中出现，我想返回每个分组的这个值，否则返回 'NA'。

英文:

Given a pandas dataframe, I would like to return a column's (string datatype) most common value for each groupby if this value shows up in more than n% of the rows, otherwise return 'NA'.

答案1

得分: 3

如果需要根据计数获取最常见值的数量：

N = 5
def f(x):
    y = x.value_counts()
    return y.index[0] if y.iat[0] > N else np.nan
df = df.groupby('g')['col'].agg(f).reset_index(name='new')

或者根据百分比：

n = 50
def f(x):
    y = x.value_counts(normalize=True) * 100
    return y.index[0] if y.iat[0] > n else np.nan
df = df.groupby('g')['col'].agg(f).reset_index(name='new')

英文:

If need test number of most common values by count:

N = 5
def f(x):
    y = x.value_counts()
    return y.index[0] if y.iat[0] &gt; N else np.nan
df = df.groupby(&#39;g&#39;)[&#39;col&#39;].agg(f).reset_index(name=&#39;new&#39;)

Or by percentages:

n = 50
def f(x):
    y = x.value_counts(normalize=True) * 100
    return y.index[0] if y.iat[0] &gt; n else np.nan
df = df.groupby(&#39;g&#39;)[&#39;col&#39;].agg(f).reset_index(name=&#39;new&#39;)

答案2

得分: 1

df = pd.DataFrame({'group': list('AAAAAABBBBBB'), 'value': list('aabbcdeeeeff')})
thresh = 3
out = (df[['group', 'value']].value_counts()
       .loc[lambda x: x > thresh]
       .groupby(level='group').idxmax().tolist()
       )

示例输出：

[('B', 'e')]

带百分比：

thresh = 30
out = (df[['group', 'value']].value_counts()
       .loc[lambda x: x > thresh/100]
       .groupby(level='group').idxmax().tolist()
       )

输出：

[('A', 'a'), ('B', 'e')]

英文:

One option using value_counts, then groupby:

df = pd.DataFrame({&#39;group&#39;: list(&#39;AAAAAABBBBBB&#39;), &#39;value&#39;: list(&#39;aabbcdeeeeff&#39;)})
thresh = 3
out = (df[[&#39;group&#39;, &#39;value&#39;]].value_counts()
       .loc[lambda x: x&gt;thresh]
       .groupby(level=&#39;group&#39;).idxmax().tolist()
       )

Example output:

[(&#39;B&#39;, &#39;e&#39;)]

With percentages:

thresh = 30
out = (df[[&#39;group&#39;, &#39;value&#39;]].value_counts()
       .loc[lambda x: x&gt;thresh/100]
       .groupby(level=&#39;group&#39;).idxmax().tolist()
       )

output:

[(&#39;A&#39;, &#39;a&#39;), (&#39;B&#39;, &#39;e&#39;)]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

For each group in Pandas dataframe, return the most common value if it shows up more than `x%` of the time

问题

答案1

答案2

使用Kolmogorov检验检查正态分布

how can i get a random sample from dataframe but have it contain a distribution of a variable? PYTHON

为什么在我通过切片更改二维列表的值时，列表的值没有反映出来？

Python在循环中增加参数数量

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。