在pandas数据框中按照某一列进行分组并聚合唯一值。

huangapple go评论100阅读模式
英文:

Group by and aggregate unique values in pandas dataframe

问题

以下是翻译好的内容:

  1. 我有一个包含以下数值的数据框:
  2. ```plaintext
  3. col1 col2 col3
  4. 10002 en tea
  5. 10002 es te
  6. 10002 ru te
  7. 10003 en coffee
  8. 10003 de kaffee
  9. 10003 nl kaaffee

我想按col1分组,并在col2的值不是'en'的情况下聚合col3的值。预期输出为:

  1. col1 术语名称 同义词
  2. 10002 tea te
  3. 10003 coffee kaffee | kaaffee

我正在运行以下代码来实现这一目标:

  1. # 分组数据并连接术语
  2. grouped = df.groupby(['col1']) \
  3. .agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])}) \
  4. .reset_index()
  5. # 将原始数据与分组数据合并
  6. df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
  7. df_result = df_result[['col1', 'col3_x', 'col3_y']]
  8. df_result.columns = ['col1', '术语名称', '同义词']

如何获取每个col1的唯一值(例如'te')?任何帮助将不胜感激。

  1. <details>
  2. <summary>英文:</summary>
  3. I have a dataframe with following values

col1 col2 col3
10002 en tea
10002 es te
10002 ru te
10003 en coffee
10003 de kaffee
10003 nl kaaffee

  1. I would like to group by col1 and aggregate the values of col3 if col2 values are other than &#39;en&#39;. The expected output is:

col1 Term Name Synonyms
10002 tea te
10003 coffee kaffee | kaaffee

  1. I am running following code to achieve this:

group the data and concatenate the terms

grouped = df.groupby(['col1'])
.agg({'col3': lambda x: ' | '.join(x[df['col2'] != 'en'])})
.reset_index()

merge the original data with the grouped data

df_result = pd.merge(df[df['col2'] == 'en'], grouped, on=['col1'], how='left')
df_result = df_result [['col1', 'col3_x', 'col3_y']]
df_result .columns = ['col1', 'Term Name', 'Synonyms']

  1. How can I get the unique value for col3 per col1 (eg &#39;te&#39;). Any help is highly appreciated.
  2. </details>
  3. # 答案1
  4. **得分**: 0
  5. 以下是翻译好的部分:
  6. 一种方法是通过掩码筛选 `en` 行,并使用 [`DataFrame.drop_duplicates`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) 连接聚合值,并删除重复项:
  7. ```python
  8. m = df['col2'] != 'en'
  9. df1 = df[m].drop_duplicates(['col1', 'col3']).groupby('col1')['col3'].agg(' | '.join)
  10. df = (df.loc[~m, ['col1', 'col3']].merge(df1, on='col1')
  11. .rename(columns={'col3_x': 'Term Name', 'col3_y': 'Synonyms'}))
  12. print (df)
  13. col1 Term Name Synonyms
  14. 0 10002 tea te
  15. 1 10003 coffee kaffee | kaaffee

或者使用 numpy.whereDataFrame.pivot_table

  1. f = lambda x: ' | '.join(dict.fromkeys(x))
  2. df = (df.assign(new=np.where(df['col2'].eq('en'), 'Term Name', 'Synonyms'))
  3. .pivot_table(index='col1', columns='new', values='col3', aggfunc=f)
  4. [['Term Name', 'Synonyms']]
  5. .reset_index()
  6. .rename_axis(None, axis=1))
  7. print (df)
  8. col1 Term Name Synonyms
  9. 0 10002 tea te
  10. 1 10003 coffee kaffee | kaaffee
英文:

One idea is filter en rows by mask and join aggregate values with removed duplicates by DataFrame.drop_duplicates:

  1. m = df[&#39;col2&#39;] != &#39;en&#39;
  2. df1 = df[m].drop_duplicates([&#39;col1&#39;,&#39;col3&#39;]).groupby(&#39;col1&#39;)[&#39;col3&#39;].agg(&#39; | &#39;.join)
  3. df = (df.loc[~m, [&#39;col1&#39;,&#39;col3&#39;]].merge(df1, on=&#39;col1&#39;)
  4. .rename(columns={&#39;col3_x&#39;:&#39;Term Name&#39;,&#39;col3_y&#39;:&#39;Synonyms&#39;}))
  5. print (df)
  6. col1 Term Name Synonyms
  7. 0 10002 tea te
  8. 1 10003 coffee kaffee | kaaffee

Or use numpy.where with DataFrame.pivot_table:

  1. f = lambda x: &#39; | &#39;.join(dict.fromkeys(x))
  2. df = (df.assign(new=np.where(df[&#39;col2&#39;].eq(&#39;en&#39;), &#39;Term Name&#39;,&#39;Synonyms&#39;))
  3. .pivot_table(index=&#39;col1&#39;, columns=&#39;new&#39;, values=&#39;col3&#39;, aggfunc=f)
  4. [[&#39;Term Name&#39;,&#39;Synonyms&#39;]]
  5. .reset_index()
  6. .rename_axis(None, axis=1))
  7. print (df)
  8. col1 Term Name Synonyms
  9. 0 10002 tea te
  10. 1 10003 coffee kaffee | kaaffee

huangapple
  • 本文由 发表于 2023年6月29日 18:44:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76580282.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定