英文:
For each group in Pandas dataframe, return the most common value if it shows up more than `x%` of the time
问题
对于给定的 pandas 数据帧,如果某一列(字符串数据类型)的最常见值在超过 n%
的行中出现,我想返回每个分组的这个值,否则返回 'NA'。
英文:
Given a pandas dataframe, I would like to return a column's (string datatype) most common value for each groupby if this value shows up in more than n%
of the rows, otherwise return 'NA'.
答案1
得分: 3
如果需要根据计数获取最常见值的数量:
N = 5
def f(x):
y = x.value_counts()
return y.index[0] if y.iat[0] > N else np.nan
df = df.groupby('g')['col'].agg(f).reset_index(name='new')
或者根据百分比:
n = 50
def f(x):
y = x.value_counts(normalize=True) * 100
return y.index[0] if y.iat[0] > n else np.nan
df = df.groupby('g')['col'].agg(f).reset_index(name='new')
英文:
If need test number of most common values by count:
N = 5
def f(x):
y = x.value_counts()
return y.index[0] if y.iat[0] > N else np.nan
df = df.groupby('g')['col'].agg(f).reset_index(name='new')
Or by percentages:
n = 50
def f(x):
y = x.value_counts(normalize=True) * 100
return y.index[0] if y.iat[0] > n else np.nan
df = df.groupby('g')['col'].agg(f).reset_index(name='new')
答案2
得分: 1
df = pd.DataFrame({'group': list('AAAAAABBBBBB'), 'value': list('aabbcdeeeeff')})
thresh = 3
out = (df[['group', 'value']].value_counts()
.loc[lambda x: x > thresh]
.groupby(level='group').idxmax().tolist()
)
示例输出:
[('B', 'e')]
带百分比:
thresh = 30
out = (df[['group', 'value']].value_counts()
.loc[lambda x: x > thresh/100]
.groupby(level='group').idxmax().tolist()
)
输出:
[('A', 'a'), ('B', 'e')]
英文:
One option using value_counts
, then groupby
:
df = pd.DataFrame({'group': list('AAAAAABBBBBB'), 'value': list('aabbcdeeeeff')})
thresh = 3
out = (df[['group', 'value']].value_counts()
.loc[lambda x: x>thresh]
.groupby(level='group').idxmax().tolist()
)
Example output:
[('B', 'e')]
With percentages:
thresh = 30
out = (df[['group', 'value']].value_counts()
.loc[lambda x: x>thresh/100]
.groupby(level='group').idxmax().tolist()
)
output:
[('A', 'a'), ('B', 'e')]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论