英文:
Group by and aggregate unique values in pandas dataframe
问题
以下是翻译好的内容:
我有一个包含以下数值的数据框:
```plaintext
col1   col2   col3  
10002  en     tea   
10002  es     te
10002  ru     te
10003  en     coffee
10003  de     kaffee
10003  nl     kaaffee
我想按col1分组,并在col2的值不是'en'的情况下聚合col3的值。预期输出为:
col1   术语名称  同义词
10002  tea    te
10003  coffee kaffee | kaaffee
我正在运行以下代码来实现这一目标:
# 分组数据并连接术语
grouped = df.groupby(['col1']) \
             .agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])}) \
             .reset_index()
# 将原始数据与分组数据合并
df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result[['col1', 'col3_x', 'col3_y']]
df_result.columns = ['col1', '术语名称', '同义词']
如何获取每个col1的唯一值(例如'te')?任何帮助将不胜感激。
<details>
<summary>英文:</summary>
I have a dataframe with following values
col1   col2   col3
10002  en     tea
10002  es     te
10002  ru     te
10003  en     coffee
10003  de     kaffee
10003  nl     kaaffee
I would like to group by col1 and aggregate the values of col3 if col2 values are other than 'en'. The expected output is:
col1   Term Name  Synonyms
10002  tea        te
10003  coffee     kaffee | kaaffee
I am running following code to achieve this:
group the data and concatenate the terms
grouped = df.groupby(['col1'])
.agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])})
.reset_index()
merge the original data with the grouped data
df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result [['col1', 'col3_x', 'col3_y']]
df_result .columns = ['col1', 'Term Name', 'Synonyms']
How can I get the unique value for col3 per col1 (eg 'te'). Any help is highly appreciated.
</details>
# 答案1
**得分**: 0
以下是翻译好的部分:
一种方法是通过掩码筛选 `en` 行,并使用 [`DataFrame.drop_duplicates`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) 连接聚合值,并删除重复项:
```python
m = df['col2'] != 'en'
df1 = df[m].drop_duplicates(['col1', 'col3']).groupby('col1')['col3'].agg(' | '.join)
df = (df.loc[~m, ['col1', 'col3']].merge(df1, on='col1')
         .rename(columns={'col3_x': 'Term Name', 'col3_y': 'Synonyms'}))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee
或者使用 numpy.where 和 DataFrame.pivot_table:
f = lambda x: ' | '.join(dict.fromkeys(x))
df = (df.assign(new=np.where(df['col2'].eq('en'), 'Term Name', 'Synonyms'))
        .pivot_table(index='col1', columns='new', values='col3', aggfunc=f)
            [['Term Name', 'Synonyms']]
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee
英文:
One idea is filter en rows by mask and join aggregate values with removed duplicates by DataFrame.drop_duplicates:
m = df['col2'] != 'en'
df1 = df[m].drop_duplicates(['col1','col3']).groupby('col1')['col3'].agg(' | '.join)
df = (df.loc[~m, ['col1','col3']].merge(df1, on='col1')
         .rename(columns={'col3_x':'Term Name','col3_y':'Synonyms'}))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee
Or use numpy.where with DataFrame.pivot_table:
f = lambda x: ' | '.join(dict.fromkeys(x))
df = (df.assign(new=np.where(df['col2'].eq('en'), 'Term Name','Synonyms'))
        .pivot_table(index='col1', columns='new', values='col3', aggfunc=f)
            [['Term Name','Synonyms']]
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论