英文:
Polars groupby + value_counts
问题
我需要一些关于 polars 的帮助:
我有一个包含分类值列的数据框:
┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ … ┆ … ┆ … │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘
我想要转换 fqid 列,使其看起来像这样:
┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ i32 ┆ i32 │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1 ┆ 1 ┆ 4 │
└───────────────────┴─────────────┴────────────┴────────────┘
也就是说,我想要:
- 在
session_id
上进行分组, - 对
fqid
进行 value_counts() 操作, - 重命名列,使其为 'fqid_' +
category
, - 将它们转换为列(转置),
- 将它们添加到结果中。
从技术上讲,我可以在不使用 groupby 的情况下实现这一点,例如:
column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.Utf8))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)
但是当我尝试从这里制作一个聚合函数时,用 train['fqid']
替换为 pl.col('fqid')
并进行 groupby('session_id').aggregate(func('fqid'))
操作时,它只给我带来了错误,例如 AttributeError: 'Expr' object has no attribute 'with_columns'
。
请您提供一种正确执行此操作的方法。
英文:
I need some help with polars:
I have a dataframe with a categorical values column
┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ … ┆ … ┆ … │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘
And I want to transform the fqid-column to look like this:
┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ i32 ┆ i32 │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1 ┆ 1 ┆ 4 │
└───────────────────┴─────────────┴────────────┴────────────┘
That is, I would like to:
- Groupby over
session_id
, - Make a value_counts() over
fqid
, - Rename columns so that it would be 'fqid_' +
category
, - Turn them into columns (transpose),
- Add them to the result.
Technically, I could achieve this without groupby by using something like
column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.Utf8))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)
But when I am trying to make an aggregating function from this replacing train['fqid']
with pl.col('fqid')
and making a groupby('session_id').aggregate(func('fqid'))
it gives me nothing but errors like AttributeError: 'Expr' object has no attribute 'with_columns'
.
Could you kindly suggest a proper way of making this operation?
答案1
得分: 3
从以下开始
train=pl.from_repr(
"""┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘""")
我们可以做
(
train
.groupby(['session_id','fqid'])
.count() #same as .agg(pl.count())
.pivot('count','session_id','fqid','first')
.select('session_id', pl.exclude('session_id').prefix('fqid_'))
)
形状: (1, 4)
┌───────────────────┬────────────┬─────────────┬────────────┐
│ session_id ┆ fqid_teddy ┆ fqid_gramps ┆ fqid_intro │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 ┆ u32 │
╞═══════════════════╪════════════╪═════════════╪════════════╡
│ 20090312431273200 ┆ 4 ┆ 3 ┆ 1 │
└───────────────────┴────────────┴─────────────┴────────────┘
因为您所需的输出是fqid列本身的计数而不是平均值或聚合,所以我们将其包括在groupby中。接下来,我们进行pivot
操作以使结果宽展。pivot的输出不保留原始列名,因此我们需要手动添加回来。我们在select
中首先获取session_id,然后添加到其中除session_id以外的每一列,前缀为'fqid_',以获得最终所需的结果。
顺便说一句,我没有使用value_counts
,因为它返回一个结构体列表,所以我们不能执行例如train.select(pl.col('fqid').value_counts().over('session_id'))
。
英文:
Starting from
train=pl.from_repr(
"""┌───────────────────┬──────────────┬────────┐
│ session_id ┆ elapsed_time ┆ fqid │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i32 ┆ cat │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0 ┆ intro │
│ 20090312431273200 ┆ 1323 ┆ gramps │
│ 20090312431273200 ┆ 831 ┆ gramps │
│ 20090312431273200 ┆ 1147 ┆ gramps │
│ 20090312431273200 ┆ 5197 ┆ teddy │
│ 20090312431273200 ┆ 6180 ┆ teddy │
│ 20090312431273200 ┆ 7014 ┆ teddy │
│ 20090312431273200 ┆ 7946 ┆ teddy │
└───────────────────┴──────────────┴────────┘""")
we can do
(
train
.groupby(['session_id','fqid'])
.count() #same as .agg(pl.count())
.pivot('count','session_id','fqid','first')
.select('session_id', pl.exclude('session_id').prefix('fqid_'))
)
shape: (1, 4)
┌───────────────────┬────────────┬─────────────┬────────────┐
│ session_id ┆ fqid_teddy ┆ fqid_gramps ┆ fqid_intro │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 ┆ u32 │
╞═══════════════════╪════════════╪═════════════╪════════════╡
│ 20090312431273200 ┆ 4 ┆ 3 ┆ 1 │
└───────────────────┴────────────┴─────────────┴────────────┘
Since the output of what you want is a count rather than an average or aggregation of the fqid column itself, we include it in the groupby. Next, we do a pivot
to make the results wide. The output of pivot doesn't keep the original column name so we have to add that back manually. We do that in a select
by first taking the session_id and then adding to that every column except session_id with the prefix 'fqid_' to get the final desired result.
Incidentally, I'm not using value_counts
because it returns a list of structs so we can't do, for example, train.select(pl.col('fqid').value_counts().over('session_id'))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论