英文:
Polars groupby + value_counts
问题
我需要一些关于 polars 的帮助:
我有一个包含分类值列的数据框:
┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ … ┆ … ┆ … │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘
我想要转换 fqid 列,使其看起来像这样:
┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ i32 ┆ i32 │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1 ┆ 1 ┆ 4 │
└───────────────────┴─────────────┴────────────┴────────────┘
也就是说,我想要:
- 在
session_id上进行分组, - 对
fqid进行 value_counts() 操作, - 重命名列,使其为 'fqid_' +
category, - 将它们转换为列(转置),
- 将它们添加到结果中。
从技术上讲,我可以在不使用 groupby 的情况下实现这一点,例如:
column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.Utf8))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)
但是当我尝试从这里制作一个聚合函数时,用 train['fqid'] 替换为 pl.col('fqid') 并进行 groupby('session_id').aggregate(func('fqid')) 操作时,它只给我带来了错误,例如 AttributeError: 'Expr' object has no attribute 'with_columns'。
请您提供一种正确执行此操作的方法。
英文:
I need some help with polars:
I have a dataframe with a categorical values column
┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ … ┆ … ┆ … │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘
And I want to transform the fqid-column to look like this:
┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ i32 ┆ i32 │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1 ┆ 1 ┆ 4 │
└───────────────────┴─────────────┴────────────┴────────────┘
That is, I would like to:
- Groupby over
session_id, - Make a value_counts() over
fqid, - Rename columns so that it would be 'fqid_' +
category, - Turn them into columns (transpose),
- Add them to the result.
Technically, I could achieve this without groupby by using something like
column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.Utf8))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)
But when I am trying to make an aggregating function from this replacing train['fqid'] with pl.col('fqid') and making a groupby('session_id').aggregate(func('fqid')) it gives me nothing but errors like AttributeError: 'Expr' object has no attribute 'with_columns'.
Could you kindly suggest a proper way of making this operation?
答案1
得分: 3
从以下开始
train=pl.from_repr(
"""┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘""")
我们可以做
(
train
.groupby(['session_id','fqid'])
.count() #same as .agg(pl.count())
.pivot('count','session_id','fqid','first')
.select('session_id', pl.exclude('session_id').prefix('fqid_'))
)
形状: (1, 4)
┌───────────────────┬────────────┬─────────────┬────────────┐
│ session_id ┆ fqid_teddy ┆ fqid_gramps ┆ fqid_intro │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 ┆ u32 │
╞═══════════════════╪════════════╪═════════════╪════════════╡
│ 20090312431273200 ┆ 4 ┆ 3 ┆ 1 │
└───────────────────┴────────────┴─────────────┴────────────┘
因为您所需的输出是fqid列本身的计数而不是平均值或聚合,所以我们将其包括在groupby中。接下来,我们进行pivot操作以使结果宽展。pivot的输出不保留原始列名,因此我们需要手动添加回来。我们在select中首先获取session_id,然后添加到其中除session_id以外的每一列,前缀为'fqid_',以获得最终所需的结果。
顺便说一句,我没有使用value_counts,因为它返回一个结构体列表,所以我们不能执行例如train.select(pl.col('fqid').value_counts().over('session_id'))。
英文:
Starting from
train=pl.from_repr(
"""┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘""")
we can do
(
train
.groupby(['session_id','fqid'])
.count() #same as .agg(pl.count())
.pivot('count','session_id','fqid','first')
.select('session_id', pl.exclude('session_id').prefix('fqid_'))
)
shape: (1, 4)
┌───────────────────┬────────────┬─────────────┬────────────┐
│ session_id ┆ fqid_teddy ┆ fqid_gramps ┆ fqid_intro │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 ┆ u32 │
╞═══════════════════╪════════════╪═════════════╪════════════╡
│ 20090312431273200 ┆ 4 ┆ 3 ┆ 1 │
└───────────────────┴────────────┴─────────────┴────────────┘
Since the output of what you want is a count rather than an average or aggregation of the fqid column itself, we include it in the groupby. Next, we do a pivot to make the results wide. The output of pivot doesn't keep the original column name so we have to add that back manually. We do that in a select by first taking the session_id and then adding to that every column except session_id with the prefix 'fqid_' to get the final desired result.
Incidentally, I'm not using value_counts because it returns a list of structs so we can't do, for example, train.select(pl.col('fqid').value_counts().over('session_id'))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论