英文:
Group by and aggregate unique values in pandas dataframe
问题
以下是翻译好的内容:
我有一个包含以下数值的数据框:
```plaintext
col1 col2 col3
10002 en tea
10002 es te
10002 ru te
10003 en coffee
10003 de kaffee
10003 nl kaaffee
我想按col1分组,并在col2的值不是'en'的情况下聚合col3的值。预期输出为:
col1 术语名称 同义词
10002 tea te
10003 coffee kaffee | kaaffee
我正在运行以下代码来实现这一目标:
# 分组数据并连接术语
grouped = df.groupby(['col1']) \
.agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])}) \
.reset_index()
# 将原始数据与分组数据合并
df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result[['col1', 'col3_x', 'col3_y']]
df_result.columns = ['col1', '术语名称', '同义词']
如何获取每个col1的唯一值(例如'te')?任何帮助将不胜感激。
<details>
<summary>英文:</summary>
I have a dataframe with following values
col1 col2 col3
10002 en tea
10002 es te
10002 ru te
10003 en coffee
10003 de kaffee
10003 nl kaaffee
I would like to group by col1 and aggregate the values of col3 if col2 values are other than 'en'. The expected output is:
col1 Term Name Synonyms
10002 tea te
10003 coffee kaffee | kaaffee
I am running following code to achieve this:
group the data and concatenate the terms
grouped = df.groupby(['col1'])
.agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])})
.reset_index()
merge the original data with the grouped data
df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result [['col1', 'col3_x', 'col3_y']]
df_result .columns = ['col1', 'Term Name', 'Synonyms']
How can I get the unique value for col3 per col1 (eg 'te'). Any help is highly appreciated.
</details>
# 答案1
**得分**: 0
以下是翻译好的部分:
一种方法是通过掩码筛选 `en` 行,并使用 [`DataFrame.drop_duplicates`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) 连接聚合值,并删除重复项:
```python
m = df['col2'] != 'en'
df1 = df[m].drop_duplicates(['col1', 'col3']).groupby('col1')['col3'].agg(' | '.join)
df = (df.loc[~m, ['col1', 'col3']].merge(df1, on='col1')
.rename(columns={'col3_x': 'Term Name', 'col3_y': 'Synonyms'}))
print (df)
col1 Term Name Synonyms
0 10002 tea te
1 10003 coffee kaffee | kaaffee
或者使用 numpy.where
和 DataFrame.pivot_table
:
f = lambda x: ' | '.join(dict.fromkeys(x))
df = (df.assign(new=np.where(df['col2'].eq('en'), 'Term Name', 'Synonyms'))
.pivot_table(index='col1', columns='new', values='col3', aggfunc=f)
[['Term Name', 'Synonyms']]
.reset_index()
.rename_axis(None, axis=1))
print (df)
col1 Term Name Synonyms
0 10002 tea te
1 10003 coffee kaffee | kaaffee
英文:
One idea is filter en
rows by mask and join aggregate values with removed duplicates by DataFrame.drop_duplicates
:
m = df['col2'] != 'en'
df1 = df[m].drop_duplicates(['col1','col3']).groupby('col1')['col3'].agg(' | '.join)
df = (df.loc[~m, ['col1','col3']].merge(df1, on='col1')
.rename(columns={'col3_x':'Term Name','col3_y':'Synonyms'}))
print (df)
col1 Term Name Synonyms
0 10002 tea te
1 10003 coffee kaffee | kaaffee
Or use numpy.where
with DataFrame.pivot_table
:
f = lambda x: ' | '.join(dict.fromkeys(x))
df = (df.assign(new=np.where(df['col2'].eq('en'), 'Term Name','Synonyms'))
.pivot_table(index='col1', columns='new', values='col3', aggfunc=f)
[['Term Name','Synonyms']]
.reset_index()
.rename_axis(None, axis=1))
print (df)
col1 Term Name Synonyms
0 10002 tea te
1 10003 coffee kaffee | kaaffee
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论