在pandas数据框中按照某一列进行分组并聚合唯一值。

huangapple go评论73阅读模式
英文:

Group by and aggregate unique values in pandas dataframe

问题

以下是翻译好的内容:

我有一个包含以下数值的数据框:

```plaintext
col1   col2   col3  
10002  en     tea   
10002  es     te
10002  ru     te
10003  en     coffee
10003  de     kaffee
10003  nl     kaaffee

我想按col1分组,并在col2的值不是'en'的情况下聚合col3的值。预期输出为:

col1   术语名称  同义词
10002  tea    te
10003  coffee kaffee | kaaffee

我正在运行以下代码来实现这一目标:

# 分组数据并连接术语
grouped = df.groupby(['col1']) \
             .agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])}) \
             .reset_index()

# 将原始数据与分组数据合并
df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result[['col1', 'col3_x', 'col3_y']]
df_result.columns = ['col1', '术语名称', '同义词']

如何获取每个col1的唯一值(例如'te')?任何帮助将不胜感激。


<details>
<summary>英文:</summary>

I have a dataframe with following values

col1 col2 col3
10002 en tea
10002 es te
10002 ru te
10003 en coffee
10003 de kaffee
10003 nl kaaffee

I would like to group by col1 and aggregate the values of col3 if col2 values are other than &#39;en&#39;. The expected output is:

col1 Term Name Synonyms
10002 tea te
10003 coffee kaffee | kaaffee


I am running following code to achieve this:

group the data and concatenate the terms

grouped = df.groupby(['col1'])
.agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])})
.reset_index()

merge the original data with the grouped data

df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result [['col1', 'col3_x', 'col3_y']]
df_result .columns = ['col1', 'Term Name', 'Synonyms']


How can I get the unique value for col3 per col1 (eg &#39;te&#39;). Any help is highly appreciated.

</details>


# 答案1
**得分**: 0

以下是翻译好的部分:

一种方法是通过掩码筛选 `en` 行,并使用 [`DataFrame.drop_duplicates`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) 连接聚合值,并删除重复项:

```python
m = df['col2'] != 'en'

df1 = df[m].drop_duplicates(['col1', 'col3']).groupby('col1')['col3'].agg(' | '.join)
df = (df.loc[~m, ['col1', 'col3']].merge(df1, on='col1')
         .rename(columns={'col3_x': 'Term Name', 'col3_y': 'Synonyms'}))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee

或者使用 numpy.whereDataFrame.pivot_table

f = lambda x: ' | '.join(dict.fromkeys(x))
df = (df.assign(new=np.where(df['col2'].eq('en'), 'Term Name', 'Synonyms'))
        .pivot_table(index='col1', columns='new', values='col3', aggfunc=f)
            [['Term Name', 'Synonyms']]
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee
英文:

One idea is filter en rows by mask and join aggregate values with removed duplicates by DataFrame.drop_duplicates:

m = df[&#39;col2&#39;] != &#39;en&#39;

df1 = df[m].drop_duplicates([&#39;col1&#39;,&#39;col3&#39;]).groupby(&#39;col1&#39;)[&#39;col3&#39;].agg(&#39; | &#39;.join)
df = (df.loc[~m, [&#39;col1&#39;,&#39;col3&#39;]].merge(df1, on=&#39;col1&#39;)
         .rename(columns={&#39;col3_x&#39;:&#39;Term Name&#39;,&#39;col3_y&#39;:&#39;Synonyms&#39;}))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee

Or use numpy.where with DataFrame.pivot_table:

f = lambda x: &#39; | &#39;.join(dict.fromkeys(x))
df = (df.assign(new=np.where(df[&#39;col2&#39;].eq(&#39;en&#39;), &#39;Term Name&#39;,&#39;Synonyms&#39;))
        .pivot_table(index=&#39;col1&#39;, columns=&#39;new&#39;, values=&#39;col3&#39;, aggfunc=f)
            [[&#39;Term Name&#39;,&#39;Synonyms&#39;]]
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
    col1 Term Name          Synonyms
0  10002       tea                te
1  10003    coffee  kaffee | kaaffee

huangapple
  • 本文由 发表于 2023年6月29日 18:44:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76580282.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定