2023年6月29日 18:44:17go评论100阅读模式

英文:

Group by and aggregate unique values in pandas dataframe

问题

以下是翻译好的内容：

我有一个包含以下数值的数据框：
```plaintext
col1   col2   col3  
10002  en     tea   
10002  es     te
10002  ru     te
10003  en     coffee
10003  de     kaffee
10003  nl     kaaffee

我想按col1分组，并在col2的值不是'en'的情况下聚合col3的值。预期输出为：

col1   术语名称  同义词
10002  tea    te
10003  coffee kaffee | kaaffee

我正在运行以下代码来实现这一目标：

# 分组数据并连接术语
grouped = df.groupby(['col1']) \
             .agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])}) \
             .reset_index()
# 将原始数据与分组数据合并
df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result[['col1', 'col3_x', 'col3_y']]
df_result.columns = ['col1', '术语名称', '同义词']

如何获取每个col1的唯一值（例如'te'）？任何帮助将不胜感激。


<details>
<summary>英文:</summary>
I have a dataframe with following values

col1 col2 col3
10002 en tea
10002 es te
10002 ru te
10003 en coffee
10003 de kaffee
10003 nl kaaffee

I would like to group by col1 and aggregate the values of col3 if col2 values are other than &#39;en&#39;. The expected output is:

col1 Term Name Synonyms
10002 tea te
10003 coffee kaffee | kaaffee


I am running following code to achieve this:

group the data and concatenate the terms

grouped = df.groupby(['col1'])
.agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])})
.reset_index()

merge the original data with the grouped data

df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result [['col1', 'col3_x', 'col3_y']]
df_result .columns = ['col1', 'Term Name', 'Synonyms']


How can I get the unique value for col3 per col1 (eg &#39;te&#39;). Any help is highly appreciated.
</details>
# 答案1
**得分**: 0
以下是翻译好的部分：
一种方法是通过掩码筛选 `en` 行，并使用 [`DataFrame.drop_duplicates`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) 连接聚合值，并删除重复项：
```python
m = df['col2'] != 'en'
df1 = df[m].drop_duplicates(['col1', 'col3']).groupby('col1')['col3'].agg(' | '.join)
df = (df.loc[~m, ['col1', 'col3']].merge(df1, on='col1')
         .rename(columns={'col3_x': 'Term Name', 'col3_y': 'Synonyms'}))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee

或者使用 numpy.where 和 DataFrame.pivot_table：

f = lambda x: ' | '.join(dict.fromkeys(x))
df = (df.assign(new=np.where(df['col2'].eq('en'), 'Term Name', 'Synonyms'))
        .pivot_table(index='col1', columns='new', values='col3', aggfunc=f)
            [['Term Name', 'Synonyms']]
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee

英文:

One idea is filter en rows by mask and join aggregate values with removed duplicates by DataFrame.drop_duplicates:

m = df[&#39;col2&#39;] != &#39;en&#39;
df1 = df[m].drop_duplicates([&#39;col1&#39;,&#39;col3&#39;]).groupby(&#39;col1&#39;)[&#39;col3&#39;].agg(&#39; | &#39;.join)
df = (df.loc[~m, [&#39;col1&#39;,&#39;col3&#39;]].merge(df1, on=&#39;col1&#39;)
         .rename(columns={&#39;col3_x&#39;:&#39;Term Name&#39;,&#39;col3_y&#39;:&#39;Synonyms&#39;}))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee

Or use numpy.where with DataFrame.pivot_table:

f = lambda x: &#39; | &#39;.join(dict.fromkeys(x))
df = (df.assign(new=np.where(df[&#39;col2&#39;].eq(&#39;en&#39;), &#39;Term Name&#39;,&#39;Synonyms&#39;))
        .pivot_table(index=&#39;col1&#39;, columns=&#39;new&#39;, values=&#39;col3&#39;, aggfunc=f)
            [[&#39;Term Name&#39;,&#39;Synonyms&#39;]]
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pandas数据框中按照某一列进行分组并聚合唯一值。

问题

group the data and concatenate the terms

merge the original data with the grouped data

Rendering Plot and Table Within One Function

pickle文件可复制吗？

“Unable to import module ‘lambda_function’: No module named ‘msgspec._core’,”

使用密钥对消息进行异或操作：TypeError：’int’对象不可调用

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。