在Polars分组聚合中,如何将每个组中的字符串值连接起来?

huangapple go评论53阅读模式
英文:

In a Polars groupby aggregation, how do you concatenate string values in each group?

问题

要在Polars DataFrame中对单列的字符串值进行连接,你可以使用agg函数,并使用pl.col('col2').agg(function),其中function应该是join_with_separator。以下是示例代码:

df.groupby('col1').agg(
    col2_g = pl.col('col2').agg(pl.col('col2').join_with_separator(','))
)

这应该会产生你期望的输出:

┌──────┬───────────┐
│ col1 ┆ col2_g    │
│ ---  ┆ ---       │
│ str  ┆ str       │
╞══════╪═══════════╡
│ a    ┆ val1,val1 │
│ b    ┆ val2,val3 │
│ c    ┆ val3      │
└──────┴───────────┘

请注意,join_with_separator函数用于在每个组内连接字符串值,并使用逗号作为分隔符。

英文:

When grouping a Polars dataframe in Python, how do you concatenate string values from a single column across rows within each group?

For example, given the following DataFrame:

import polars as pl

df = pl.DataFrame(
    {
        "col1": ["a", "b", "a", "b", "c"],
        "col2": ["val1", "val2", "val1", "val3", "val3"]
    }
)

Original df:

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ a    ┆ val1 │
│ b    ┆ val2 │
│ a    ┆ val1 │
│ b    ┆ val3 │
│ c    ┆ val3 │
└──────┴──────┘

I want to run a groupby operation, like:


df.groupby('col1').agg(
    col2_g = pl.col('col2').some_function_like_join(',')
)

The expected output is:

┌──────┬───────────┐
│ col1 ┆ col2_g    │
│ ---  ┆ ---       │
│ str  ┆ str       │
╞══════╪═══════════╡
│ a    ┆ val1,val1 │
│ b    ┆ val2,val3 │
│ c    ┆ val3      │
└──────┴───────────┘

What is the name of the some_function_like_join function?

I have tried the following methods, and none work:

df.groupby('col1').agg(pl.col('col2').arr.concat(','))
df.groupby('col1').agg(pl.col('col2').join(','))
df.groupby('col1').agg(pl.col('col2').arr.join(','))

答案1

得分: 3

如果您想要将它们连接起来,我假设您希望将结果作为一个字符串,使用您指定的分隔符:

out = df.groupby("col1").agg(
    pl.col("col2").str.concat(",")
)

结果:

shape: (3, 2)
┌──────┬───────────┐
 col1  col2      
 ---   ---       
 str   str       
╞══════╪═══════════╡
 a     val1,val1 
 b     val2,val3 
 c     val3      
└──────┴───────────┘

如果您希望它们放在一个List中,只需这样做:

out = df.groupby("col1").agg(
    pl.col("col2")
)

结果:

shape: (3, 2)
┌──────┬──────────────────┐
 col1  col2             
 ---   ---              
 str   list[str]        
╞══════╪══════════════════╡
 a     ["val1", "val1"] 
 c     ["val3"]         
 b     ["val2", "val3"] 
└──────┴──────────────────┘
英文:

If you want to concatenate them, I assume you want the result as a string with your specified delimiter:

out = df.groupby("col1").agg(
    pl.col("col2").str.concat(",")
)

Result:

shape: (3, 2)
┌──────┬───────────┐
│ col1 ┆ col2      │
│ ---  ┆ ---       │
│ str  ┆ str       │
╞══════╪═══════════╡
│ a    ┆ val1,val1 │
│ b    ┆ val2,val3 │
│ c    ┆ val3      │
└──────┴───────────┘

If you want them within a List, you simply do:

out = df.groupby("col1").agg(
    pl.col("col2")
)

Result:

shape: (3, 2)
┌──────┬──────────────────┐
│ col1 ┆ col2             │
│ ---  ┆ ---              │
│ str  ┆ list[str]        │
╞══════╪══════════════════╡
│ a    ┆ ["val1", "val1"] │
│ c    ┆ ["val3"]         │
│ b    ┆ ["val2", "val3"] │
└──────┴──────────────────┘

答案2

得分: 0

我认为最直接的方法是在agg之后使用with_columns。 聚合后的列将是List类型:

df.groupby('col1').agg(pl.col('col2')).with_columns(pl.col('col2').arr.concat(','))
英文:

I think the most straightforward way is to do a with_columns after the agg. The aggregated columns will be a List dtype:

df.groupby('col1').agg(pl.col('col2')).with_columns(pl.col('col2').arr.concat(','))

huangapple
  • 本文由 发表于 2023年5月11日 08:23:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76223362.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定