For each group in Pandas dataframe, return the most common value if it shows up more than `x%` of the time

huangapple go评论87阅读模式
英文:

For each group in Pandas dataframe, return the most common value if it shows up more than `x%` of the time

问题

对于给定的 pandas 数据帧,如果某一列(字符串数据类型)的最常见值在超过 n% 的行中出现,我想返回每个分组的这个值,否则返回 'NA'。

英文:

Given a pandas dataframe, I would like to return a column's (string datatype) most common value for each groupby if this value shows up in more than n% of the rows, otherwise return 'NA'.

答案1

得分: 3

如果需要根据计数获取最常见值的数量:

N = 5
def f(x):
    y = x.value_counts()
    return y.index[0] if y.iat[0] > N else np.nan

df = df.groupby('g')['col'].agg(f).reset_index(name='new')

或者根据百分比:

n = 50
def f(x):
    y = x.value_counts(normalize=True) * 100
    return y.index[0] if y.iat[0] > n else np.nan

df = df.groupby('g')['col'].agg(f).reset_index(name='new')
英文:

If need test number of most common values by count:

N = 5
def f(x):
    y = x.value_counts()
    return y.index[0] if y.iat[0] > N else np.nan


df = df.groupby('g')['col'].agg(f).reset_index(name='new')

Or by percentages:

n = 50
def f(x):
    y = x.value_counts(normalize=True) * 100
    return y.index[0] if y.iat[0] > n else np.nan


df = df.groupby('g')['col'].agg(f).reset_index(name='new')

答案2

得分: 1

df = pd.DataFrame({'group': list('AAAAAABBBBBB'), 'value': list('aabbcdeeeeff')})

thresh = 3
out = (df[['group', 'value']].value_counts()
       .loc[lambda x: x > thresh]
       .groupby(level='group').idxmax().tolist()
       )

示例输出:

[('B', 'e')]

带百分比:

thresh = 30
out = (df[['group', 'value']].value_counts()
       .loc[lambda x: x > thresh/100]
       .groupby(level='group').idxmax().tolist()
       )

输出:

[('A', 'a'), ('B', 'e')]
英文:

One option using value_counts, then groupby:

df = pd.DataFrame({'group': list('AAAAAABBBBBB'), 'value': list('aabbcdeeeeff')})

thresh = 3
out = (df[['group', 'value']].value_counts()
       .loc[lambda x: x>thresh]
       .groupby(level='group').idxmax().tolist()
       )

Example output:

[('B', 'e')]

With percentages:

thresh = 30
out = (df[['group', 'value']].value_counts()
       .loc[lambda x: x>thresh/100]
       .groupby(level='group').idxmax().tolist()
       )

output:

[('A', 'a'), ('B', 'e')]

huangapple
  • 本文由 发表于 2023年3月8日 16:15:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75670657.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定