2023年6月1日 16:50:23go评论153阅读模式

英文:

Polars groupby + value_counts

问题

我需要一些关于 polars 的帮助：

我有一个包含分类值列的数据框：

┌───────────────────┬──────────────┬────────┐
│ session_id        ┆ elapsed_time ┆ fqid   │
│ ---               ┆ ---          ┆ ---    │
│ i64               ┆ i32          ┆ cat    │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0            ┆ intro  │
│ 20090312431273200 ┆ 1323         ┆ gramps │
│ 20090312431273200 ┆ 831          ┆ gramps │
│ 20090312431273200 ┆ 1147         ┆ gramps │
│ …                 ┆ …            ┆ …      │
│ 20090312431273200 ┆ 5197         ┆ teddy  │
│ 20090312431273200 ┆ 6180         ┆ teddy  │
│ 20090312431273200 ┆ 7014         ┆ teddy  │
│ 20090312431273200 ┆ 7946         ┆ teddy  │
└───────────────────┴──────────────┴────────┘

我想要转换 fqid 列，使其看起来像这样：

┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id        ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ ---               ┆ ---         ┆ ---        ┆ ---        │
│ i64               ┆ i32         ┆ i32        ┆ i32        │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1           ┆ 1          ┆ 4          │
└───────────────────┴─────────────┴────────────┴────────────┘

也就是说，我想要：

在 session_id 上进行分组，
对 fqid 进行 value_counts() 操作，
重命名列，使其为 'fqid_' + category，
将它们转换为列（转置），
将它们添加到结果中。

从技术上讲，我可以在不使用 groupby 的情况下实现这一点，例如：

column_values = train['fqid'].value_counts().with_columns(pl.concat_str(pl.lit('fqid' + '_').alias('fqid'), pl.col('fqid').cast(pl.Utf8))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)

但是当我尝试从这里制作一个聚合函数时，用 train['fqid'] 替换为 pl.col('fqid') 并进行 groupby('session_id').aggregate(func('fqid')) 操作时，它只给我带来了错误，例如 AttributeError: 'Expr' object has no attribute 'with_columns'。

请您提供一种正确执行此操作的方法。

英文:

I need some help with polars:

I have a dataframe with a categorical values column

┌───────────────────┬──────────────┬────────┐
│ session_id        ┆ elapsed_time ┆ fqid   │
│ ---               ┆ ---          ┆ ---    │
│ i64               ┆ i32          ┆ cat    │
╞═══════════════════╪══════════════╪════════╡
│ 20090312431273200 ┆ 0            ┆ intro  │
│ 20090312431273200 ┆ 1323         ┆ gramps │
│ 20090312431273200 ┆ 831          ┆ gramps │
│ 20090312431273200 ┆ 1147         ┆ gramps │
│ …                 ┆ …            ┆ …      │
│ 20090312431273200 ┆ 5197         ┆ teddy  │
│ 20090312431273200 ┆ 6180         ┆ teddy  │
│ 20090312431273200 ┆ 7014         ┆ teddy  │
│ 20090312431273200 ┆ 7946         ┆ teddy  │
└───────────────────┴──────────────┴────────┘

And I want to transform the fqid-column to look like this:

┌───────────────────┬─────────────┬────────────┬────────────┐
│ session_id        ┆ fqid_gramps ┆ fqid_intro ┆ fqid_teddy │
│ ---               ┆ ---         ┆ ---        ┆ ---        │
│ i64               ┆ i32         ┆ i32        ┆ i32        │
╞═══════════════════╪═════════════╪════════════╪════════════╡
│ 20090312431273200 ┆ 1           ┆ 1          ┆ 4          │
└───────────────────┴─────────────┴────────────┴────────────┘

That is, I would like to:

Groupby over session_id,
Make a value_counts() over fqid,
Rename columns so that it would be 'fqid_' + category,
Turn them into columns (transpose),
Add them to the result.

Technically, I could achieve this without groupby by using something like

column_values = train[&#39;fqid&#39;].value_counts().with_columns(pl.concat_str(pl.lit(&#39;fqid&#39; + &#39;_&#39;).alias(&#39;fqid&#39;), pl.col(&#39;fqid&#39;).cast(pl.Utf8))).transpose()
column_values = column_values.rename(column_values.head(1).to_dicts().pop()).slice(1)

But when I am trying to make an aggregating function from this replacing train['fqid'] with pl.col('fqid') and making a groupby('session_id').aggregate(func('fqid')) it gives me nothing but errors like AttributeError: 'Expr' object has no attribute 'with_columns'.

Could you kindly suggest a proper way of making this operation?

答案1

得分: 3

从以下开始

train=pl.from_repr(
     &quot;&quot;&quot;┌───────────────────┬──────────────┬────────┐
        │ session_id        ┆ elapsed_time ┆ fqid   │
        │ ---               ┆ ---          ┆ ---    │
        │ i64               ┆ i32          ┆ cat    │
        ╞═══════════════════╪══════════════╪════════╡
        │ 20090312431273200 ┆ 0            ┆ intro  │
        │ 20090312431273200 ┆ 1323         ┆ gramps │
        │ 20090312431273200 ┆ 831          ┆ gramps │
        │ 20090312431273200 ┆ 1147         ┆ gramps │
        │ 20090312431273200 ┆ 5197         ┆ teddy  │
        │ 20090312431273200 ┆ 6180         ┆ teddy  │
        │ 20090312431273200 ┆ 7014         ┆ teddy  │
        │ 20090312431273200 ┆ 7946         ┆ teddy  │
        └───────────────────┴──────────────┴────────┘&quot;&quot;&quot;)

我们可以做

(
    train
        .groupby(['session_id','fqid']) 
        .count()  #same as .agg(pl.count())
        .pivot('count','session_id','fqid','first')
        .select('session_id', pl.exclude('session_id').prefix('fqid_'))
)

形状: (1, 4)

┌───────────────────┬────────────┬─────────────┬────────────┐
│ session_id        ┆ fqid_teddy ┆ fqid_gramps ┆ fqid_intro │
│ ---               ┆ ---        ┆ ---         ┆ ---        │
│ i64               ┆ u32        ┆ u32         ┆ u32        │
╞═══════════════════╪════════════╪═════════════╪════════════╡
│ 20090312431273200 ┆ 4          ┆ 3           ┆ 1          │
└───────────────────┴────────────┴─────────────┴────────────┘

因为您所需的输出是fqid列本身的计数而不是平均值或聚合，所以我们将其包括在groupby中。接下来，我们进行pivot操作以使结果宽展。pivot的输出不保留原始列名，因此我们需要手动添加回来。我们在select中首先获取session_id，然后添加到其中除session_id以外的每一列，前缀为'fqid_'，以获得最终所需的结果。

顺便说一句，我没有使用value_counts，因为它返回一个结构体列表，所以我们不能执行例如train.select(pl.col('fqid').value_counts().over('session_id'))。

英文:

Starting from

train=pl.from_repr(
     &quot;&quot;&quot;┌───────────────────┬──────────────┬────────┐
        │ session_id        ┆ elapsed_time ┆ fqid   │
        │ ---               ┆ ---          ┆ ---    │
        │ i64               ┆ i32          ┆ cat    │
        ╞═══════════════════╪══════════════╪════════╡
        │ 20090312431273200 ┆ 0            ┆ intro  │
        │ 20090312431273200 ┆ 1323         ┆ gramps │
        │ 20090312431273200 ┆ 831          ┆ gramps │
        │ 20090312431273200 ┆ 1147         ┆ gramps │
        │ 20090312431273200 ┆ 5197         ┆ teddy  │
        │ 20090312431273200 ┆ 6180         ┆ teddy  │
        │ 20090312431273200 ┆ 7014         ┆ teddy  │
        │ 20090312431273200 ┆ 7946         ┆ teddy  │
        └───────────────────┴──────────────┴────────┘&quot;&quot;&quot;)

we can do

(
    train
        .groupby([&#39;session_id&#39;,&#39;fqid&#39;]) 
        .count()  #same as .agg(pl.count())
        .pivot(&#39;count&#39;,&#39;session_id&#39;,&#39;fqid&#39;,&#39;first&#39;)
        .select(&#39;session_id&#39;, pl.exclude(&#39;session_id&#39;).prefix(&#39;fqid_&#39;))
)

shape: (1, 4)
┌───────────────────┬────────────┬─────────────┬────────────┐
│ session_id        ┆ fqid_teddy ┆ fqid_gramps ┆ fqid_intro │
│ ---               ┆ ---        ┆ ---         ┆ ---        │
│ i64               ┆ u32        ┆ u32         ┆ u32        │
╞═══════════════════╪════════════╪═════════════╪════════════╡
│ 20090312431273200 ┆ 4          ┆ 3           ┆ 1          │
└───────────────────┴────────────┴─────────────┴────────────┘

Since the output of what you want is a count rather than an average or aggregation of the fqid column itself, we include it in the groupby. Next, we do a pivot to make the results wide. The output of pivot doesn't keep the original column name so we have to add that back manually. We do that in a select by first taking the session_id and then adding to that every column except session_id with the prefix 'fqid_' to get the final desired result.

Incidentally, I'm not using value_counts because it returns a list of structs so we can't do, for example, train.select(pl.col('fqid').value_counts().over('session_id'))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Polars groupby + value_counts

问题

答案1

为什么arcpy中的len()函数与实际情况不匹配

使用Matplotlib进行多曲线拟合

Export NumPy array as NetCDF4 file

时间复杂度在使用Python字符串时

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论