英文:
python-polars Join Column Values into a concatenated string
问题
I am trying to write an aggregation routine where values in columns are concatenated based on a groupby statement.
我正在尝试编写一个汇总例程,根据groupby语句将列中的值连接起来。
I am trying to call a custom function to do the aggregation, and also trying to avoid using lambda (my understanding is – lambda functions only run in serial, hence performance would be slower).
我尝试调用自定义函数来进行聚合,并尝试避免使用lambda(我理解的是lambda函数只能串行运行,因此性能会较慢)。
Here is my code:
这是我的代码:
def agg_ll_field(col_name) -> pl.Expr:
return ';;'.join(pl.col(col_name).drop_nulls().unique().sort())
dfa = df.lazy()\
.groupby(by=['SharedSourceSystem', 'FOPortfolioName']).agg(
[
, agg_ll_field('BookingUnits').alias('BOOKG_UNIT')
]).collect()
I keep on getting an error:
我一直在遇到一个错误:
agg_ll_field: Unexpected: can only join an iterable <class 'TypeError'>
Would anyone be able to help resolve this?
有人能帮助解决这个问题吗?
Thank you!
谢谢!
英文:
I am trying to write an aggregation routine where values in columns are concatenated based on a groupby statement.
I am trying to call a custom function to do the aggregation, and also trying to avoid using lambda (my understanding is – lambda functions only run in serial, hence performance would be slower). Here is my code:
def agg_ll_field(col_name) -> pl.Expr:
return ';'.join(pl.col(col_name).drop_nulls().unique().sort())
dfa = df.lazy()\
.groupby(by=['SharedSourceSystem', 'FOPortfolioName']).agg(
[
, agg_ll_field('BookingUnits').alias('BOOKG_UNIT')
]).collect()
I keep on getting an error:
agg_ll_field: Unexpected: can only join an iterable <class 'TypeError'>
Would anyone be able to help resolve this?
Thank you!
I tried using apply function instead - that seems to work but I'm trying to avoid apply, since performance is supposed to be worse.
答案1
得分: 1
这是使用str.concat
的完整示例:
# 创建一个样本 DataFrame
data = {
'SharedSourceSystem': ['A', 'A', 'B', 'B', 'B'],
'FOPortfolioName': ['X', 'X', 'Y', 'Y', 'Y'],
'BookingUnits': [1, 2, 2, 2, 3]
}
df = pl.DataFrame(data)
# 定义自定义聚合函数
def agg_ll_field(col_name) -> pl.Expr:
return pl.col(col_name).drop_nulls().unique().sort().str.concat(';')
# 应用惰性分组和聚合
dfa = df.lazy()\
.groupby(by=['SharedSourceSystem', 'FOPortfolioName']).agg(
[
agg_ll_field('BookingUnits').alias('BOOKG_UNIT')
]).collect()
# 输出
┌────────────────────┬─────────────────┬────────────┐
│ SharedSourceSystem ┆ FOPortfolioName ┆ BOOKG_UNIT │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════════════╪═════════════════╪════════════╡
│ A ┆ X ┆ 1;2 │
│ B ┆ Y ┆ 2;3 │
└────────────────────┴─────────────────┴────────────┘
英文:
Here is the full example using str.concat
:
# Create a sample DataFrame
data = {
'SharedSourceSystem': ['A', 'A', 'B', 'B', 'B'],
'FOPortfolioName': ['X', 'X', 'Y', 'Y', 'Y'],
'BookingUnits': [1, 2, 2, 2, 3]
}
df = pl.DataFrame(data)
# Define the custom aggregation function
def agg_ll_field(col_name) -> pl.Expr:
return pl.col(col_name).drop_nulls().unique().sort().str.concat(';')
# Apply the lazy groupby and aggregation
dfa = df.lazy()\
.groupby(by=['SharedSourceSystem', 'FOPortfolioName']).agg(
[
agg_ll_field('BookingUnits').alias('BOOKG_UNIT')
]).collect()
# Output
┌────────────────────┬─────────────────┬────────────┐
│ SharedSourceSystem ┆ FOPortfolioName ┆ BOOKG_UNIT │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════════════╪═════════════════╪════════════╡
│ A ┆ X ┆ 1;2 │
│ B ┆ Y ┆ 2;3 │
└────────────────────┴─────────────────┴────────────┘
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论