2023年6月27日 20:23:53go评论89阅读模式

英文:

How can I drop duplicate rows from a Polars DataFrame according to a custom function?

问题

我试图在特定列中删除所有重复项并聚合非重复项的数据帧。.unique 函数让我选择{‘first’, ‘last’, ‘any’, ‘none’}之一。然而，我想对所有数值应用mean函数，对所有分类值应用mode函数。

我可以通过在我感兴趣的列上使用groupby来实现这一点，就像下面的示例中所示：

df = pl.DataFrame(
    {
        "id": [0, 0, 0, 1, 1],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
        "size": [2, 4, 6, 1, 3]
    }
)

df_list = []
for gkey, group in df.groupby("id"):
    g = group.select(pl.col("id"),
       pl.all().exclude("id", "size").mode().first(),
       pl.col("size").mean()
    ).unique()
    df_list.append(g)

df_dedup = pl.concat(df_list)

这会给我期望的输出：

> print(df_dedup)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

问题是，这个实现（不出所料）相当慢，所以我想知道是否有更好的方法来实现这一点，或者是否有办法优化我现有的代码。

英文:

I have a dataframe where I am trying to remove all duplicates in a specific column while aggregating the non-duplicated values.

The .unique function lets me choose only one of {‘first’, ‘last’, ‘any’, ‘none’}. However what I want is to apply the mean function to all numeric values, and the mode function to all categorical values.

I can do this by using groupby on the column I'm interested in, like in the example below:

df = pl.DataFrame(
    {
        &quot;id&quot;: [0, 0, 0, 1, 1],
        &quot;color&quot;: [&quot;red&quot;, &quot;green&quot;, &quot;green&quot;, &quot;red&quot;, &quot;red&quot;],
        &quot;shape&quot;: [&quot;square&quot;, &quot;triangle&quot;, &quot;square&quot;, &quot;triangle&quot;, &quot;square&quot;],
        &quot;size&quot;: [2, 4, 6, 1, 3]
    }
)

df_list = []
for gkey, group in df.groupby(&quot;id&quot;):
    g = group.select(pl.col(&quot;id&quot;),
       pl.all().exclude(&quot;id&quot;, &quot;size&quot;).mode().first(),
       pl.col(&quot;size&quot;).mean()
    ).unique()
    df_list.append(g)

df_dedup = pl.concat(df_list)

Which gives me the output I'm expecting:

&gt; print(df_dedup)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

The problem is that this implementation is (unsurprisingly) quite slow, so I was wondering if there was a better way of doing this, or if it's possible to optimize the code I have in some way.

答案1

得分: 2

以下是翻译好的部分：

How about

In [22]: df.groupby("id").agg(
    ...:     pl.col(["color", "shape"]).mode().sort(descending=True).first(),
    ...:     pl.col("size").mean(),
    ...: )
    ...:
Out[22]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

或者，使用列选择器：

In [30]: import polars.selectors as cs

In [31]: df.groupby("id").agg(
    ...:     cs.string().mode().sort(descending=True).first(),
    ...:     cs.numeric().mean(),
    ...: )
Out[31]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 0   ┆ green ┆ square   ┆ 4.0  │
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
└─────┴───────┴──────────┴──────┘

英文:

How about

In [22]: df.groupby(&quot;id&quot;).agg(
    ...:     pl.col([&quot;color&quot;, &quot;shape&quot;]).mode().sort(descending=True).first(),
    ...:     pl.col(&quot;size&quot;).mean(),
    ...: )
    ...:
Out[22]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

Or, using the column selectors:

In [30]: import polars.selectors as cs

In [31]: df.groupby(&quot;id&quot;).agg(
    ...:     cs.string().mode().sort(descending=True).first(),
    ...:     cs.numeric().mean(),
    ...: )
Out[31]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 0   ┆ green ┆ square   ┆ 4.0  │
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
└─────┴───────┴──────────┴──────┘

答案2

得分: 2

你可以按类型选择列，例如 pl.col(pl.Utf8) 用于所有 str 类型的列。

还有新的 polars.selectors 辅助模块。

import polars.selectors as cs

df.groupby('id').agg(
   cs.string().mode().first(),
   cs.numeric().mean()
)

shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

英文:

You can select columns by type, e.g. pl.col(pl.Utf8) for all str columns.

There's also the new polars.selectors helper module.

import polars.selectors as cs

df.groupby(&#39;id&#39;).agg(
   cs.string().mode().first(),
   cs.numeric().mean()
)

shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何根据自定义函数从 Polars DataFrame 中删除重复行？

问题

答案1

答案2

使用OpenCV在Python中，如何获取检测到的轮廓的极端点？

安装pyodbc时出现错误，无法为pyodbc构建轮子。

如何通过Python发送一个“数组”电子邮件？

提取`scipy.optimize.newton()`结果的收敛布尔值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论