Polars groupby + value_counts

huangapple go评论143阅读模式
英文:

Polars groupby + value_counts

问题

我需要一些关于 polars 的帮助:

我有一个包含分类值列的数据框:

┌───────────────────┬──────────────┬────────┐
│ session_id        ┆ elapsed_time ┆ fqid   │
│ ---               ┆ ---          ┆ ---    │
│ i64               ┆ i32          ┆ cat    │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0            ┆ intro  │
│ 20090312431273200 ┆ 1323         ┆ gramps │
│ 20090312431273200 ┆ 831          ┆ gramps │
│ 20090312431273200 ┆ 1147         ┆ gramps │
│ …                 ┆ …            ┆ …      │
│ 20090312431273200 ┆ 5197         ┆ teddy  │
│ 20090312431273200 ┆ 6180         ┆ teddy  │
│ 20090312431273200 ┆ 7014         ┆ teddy  │
│ 20090312431273200 ┆ 7946         ┆ teddy  │
└───────────────────┴──────────────┴────────┘

我想要转换 fqid 列,使其看起来像这样:

┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id        ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ ---               ┆ ---         ┆ ---        ┆ ---        │
│ i64               ┆ i32         ┆ i32        ┆ i32        │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1           ┆ 1          ┆ 4          │
└───────────────────┴─────────────┴────────────┴────────────┘

也就是说,我想要:

  • session_id 上进行分组,
  • fqid 进行 value_counts() 操作,
  • 重命名列,使其为 'fqid_' + category
  • 将它们转换为列(转置),
  • 将它们添加到结果中。

从技术上讲,我可以在不使用 groupby 的情况下实现这一点,例如:

column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.Utf8))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)

但是当我尝试从这里制作一个聚合函数时,用 train['fqid'] 替换为 pl.col('fqid') 并进行 groupby('session_id').aggregate(func('fqid')) 操作时,它只给我带来了错误,例如 AttributeError: 'Expr' object has no attribute 'with_columns'

请您提供一种正确执行此操作的方法。

英文:

I need some help with polars:

I have a dataframe with a categorical values column

┌───────────────────┬──────────────┬────────┐
│ session_id        ┆ elapsed_time ┆ fqid   │
│ ---               ┆ ---          ┆ ---    │
│ i64               ┆ i32          ┆ cat    │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0            ┆ intro  │
│ 20090312431273200 ┆ 1323         ┆ gramps │
│ 20090312431273200 ┆ 831          ┆ gramps │
│ 20090312431273200 ┆ 1147         ┆ gramps │
│ …                 ┆ …            ┆ …      │
│ 20090312431273200 ┆ 5197         ┆ teddy  │
│ 20090312431273200 ┆ 6180         ┆ teddy  │
│ 20090312431273200 ┆ 7014         ┆ teddy  │
│ 20090312431273200 ┆ 7946         ┆ teddy  │
└───────────────────┴──────────────┴────────┘

And I want to transform the fqid-column to look like this:

┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id        ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ ---               ┆ ---         ┆ ---        ┆ ---        │
│ i64               ┆ i32         ┆ i32        ┆ i32        │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1           ┆ 1          ┆ 4          │
└───────────────────┴─────────────┴────────────┴────────────┘

That is, I would like to:

  • Groupby over session_id,
  • Make a value_counts() over fqid,
  • Rename columns so that it would be 'fqid_' + category,
  • Turn them into columns (transpose),
  • Add them to the result.

Technically, I could achieve this without groupby by using something like

column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.Utf8))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)

But when I am trying to make an aggregating function from this replacing train['fqid'] with pl.col('fqid') and making a groupby('session_id').aggregate(func('fqid')) it gives me nothing but errors like AttributeError: 'Expr' object has no attribute 'with_columns'.

Could you kindly suggest a proper way of making this operation?

答案1

得分: 3

从以下开始

train=pl.from_repr(
     """┌───────────────────┬──────────────┬────────┐
         session_id         elapsed_time  fqid   
         ---                ---           ---    
         i64                i32           cat    
        ╞═══════════════════╪══════════════╪════════╡
         20090312431273200  0             intro  
         20090312431273200  1323          gramps 
         20090312431273200  831           gramps 
         20090312431273200  1147          gramps 
         20090312431273200  5197          teddy  
         20090312431273200  6180          teddy  
         20090312431273200  7014          teddy  
         20090312431273200  7946          teddy  
        └───────────────────┴──────────────┴────────┘""")

我们可以做

(
    train
        .groupby(['session_id','fqid']) 
        .count()  #same as .agg(pl.count())
        .pivot('count','session_id','fqid','first')
        .select('session_id', pl.exclude('session_id').prefix('fqid_'))
)

形状: (1, 4)

┌───────────────────┬────────────┬─────────────┬────────────┐
 session_id         fqid_teddy  fqid_gramps  fqid_intro 
 ---                ---         ---          ---        
 i64                u32         u32          u32        
╞═══════════════════╪════════════╪═════════════╪════════════╡
 20090312431273200  4           3            1          
└───────────────────┴────────────┴─────────────┴────────────┘

因为您所需的输出是fqid列本身的计数而不是平均值或聚合,所以我们将其包括在groupby中。接下来,我们进行pivot操作以使结果宽展。pivot的输出不保留原始列名,因此我们需要手动添加回来。我们在select中首先获取session_id,然后添加到其中除session_id以外的每一列,前缀为'fqid_',以获得最终所需的结果。

顺便说一句,我没有使用value_counts,因为它返回一个结构体列表,所以我们不能执行例如train.select(pl.col('fqid').value_counts().over('session_id'))

英文:

Starting from

train=pl.from_repr(
     """┌───────────────────┬──────────────┬────────┐
        │ session_id        ┆ elapsed_time ┆ fqid   │
        │ ---               ┆ ---          ┆ ---    │
        │ i64               ┆ i32          ┆ cat    │
        ╞═══════════════════╪══════════════╪════════╡
        │ 20090312431273200 ┆ 0            ┆ intro  │
        │ 20090312431273200 ┆ 1323         ┆ gramps │
        │ 20090312431273200 ┆ 831          ┆ gramps │
        │ 20090312431273200 ┆ 1147         ┆ gramps │
        │ 20090312431273200 ┆ 5197         ┆ teddy  │
        │ 20090312431273200 ┆ 6180         ┆ teddy  │
        │ 20090312431273200 ┆ 7014         ┆ teddy  │
        │ 20090312431273200 ┆ 7946         ┆ teddy  │
        └───────────────────┴──────────────┴────────┘""")

we can do

(
    train
        .groupby(['session_id','fqid']) 
        .count()  #same as .agg(pl.count())
        .pivot('count','session_id','fqid','first')
        .select('session_id', pl.exclude('session_id').prefix('fqid_'))
)

shape: (1, 4)
┌───────────────────┬────────────┬─────────────┬────────────┐
│ session_id        ┆ fqid_teddy ┆ fqid_gramps ┆ fqid_intro │
│ ---               ┆ ---        ┆ ---         ┆ ---        │
│ i64               ┆ u32        ┆ u32         ┆ u32        │
╞═══════════════════╪════════════╪═════════════╪════════════╡
│ 20090312431273200 ┆ 4          ┆ 3           ┆ 1          │
└───────────────────┴────────────┴─────────────┴────────────┘

Since the output of what you want is a count rather than an average or aggregation of the fqid column itself, we include it in the groupby. Next, we do a pivot to make the results wide. The output of pivot doesn't keep the original column name so we have to add that back manually. We do that in a select by first taking the session_id and then adding to that every column except session_id with the prefix 'fqid_' to get the final desired result.

Incidentally, I'm not using value_counts because it returns a list of structs so we can't do, for example, train.select(pl.col('fqid').value_counts().over('session_id'))

huangapple
  • 本文由 发表于 2023年6月1日 16:50:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380172.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定