如何根据自定义函数从 Polars DataFrame 中删除重复行?

huangapple go评论89阅读模式
英文:

How can I drop duplicate rows from a Polars DataFrame according to a custom function?

问题

我试图在特定列中删除所有重复项并聚合非重复项的数据帧。.unique 函数让我选择{‘first’, ‘last’, ‘any’, ‘none’}之一。然而,我想对所有数值应用mean函数,对所有分类值应用mode函数。

我可以通过在我感兴趣的列上使用groupby来实现这一点,就像下面的示例中所示:

df = pl.DataFrame(
    {
        "id": [0, 0, 0, 1, 1],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
        "size": [2, 4, 6, 1, 3]
    }
)

df_list = []
for gkey, group in df.groupby("id"):
    g = group.select(pl.col("id"),
       pl.all().exclude("id", "size").mode().first(),
       pl.col("size").mean()
    ).unique()
    df_list.append(g)

df_dedup = pl.concat(df_list)

这会给我期望的输出:

> print(df_dedup)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
 id   color  shape     size 
 ---  ---    ---       ---  
 i64  str    str       f64  
╞═════╪═══════╪══════════╪══════╡
 1    red    triangle  2.0  
 0    green  square    4.0  
└─────┴───────┴──────────┴──────┘

问题是,这个实现(不出所料)相当慢,所以我想知道是否有更好的方法来实现这一点,或者是否有办法优化我现有的代码。

英文:

I have a dataframe where I am trying to remove all duplicates in a specific column while aggregating the non-duplicated values.

The .unique function lets me choose only one of {‘first’, ‘last’, ‘any’, ‘none’}. However what I want is to apply the mean function to all numeric values, and the mode function to all categorical values.

I can do this by using groupby on the column I'm interested in, like in the example below:

df = pl.DataFrame(
    {
        "id": [0, 0, 0, 1, 1],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
        "size": [2, 4, 6, 1, 3]
    }
)

df_list = []
for gkey, group in df.groupby("id"):
    g = group.select(pl.col("id"),
       pl.all().exclude("id", "size").mode().first(),
       pl.col("size").mean()
    ).unique()
    df_list.append(g)

df_dedup = pl.concat(df_list)

Which gives me the output I'm expecting:

> print(df_dedup)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

The problem is that this implementation is (unsurprisingly) quite slow, so I was wondering if there was a better way of doing this, or if it's possible to optimize the code I have in some way.

答案1

得分: 2

以下是翻译好的部分:

  1. How about
In [22]: df.groupby("id").agg(
    ...:     pl.col(["color", "shape"]).mode().sort(descending=True).first(),
    ...:     pl.col("size").mean(),
    ...: )
    ...:
Out[22]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
 id   color  shape     size 
 ---  ---    ---       ---  
 i64  str    str       f64  
╞═════╪═══════╪══════════╪══════╡
 1    red    triangle  2.0  
 0    green  square    4.0  
└─────┴───────┴──────────┴──────┘

或者,使用列选择器:

In [30]: import polars.selectors as cs

In [31]: df.groupby("id").agg(
    ...:     cs.string().mode().sort(descending=True).first(),
    ...:     cs.numeric().mean(),
    ...: )
Out[31]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
 id   color  shape     size 
 ---  ---    ---       ---  
 i64  str    str       f64  
╞═════╪═══════╪══════════╪══════╡
 0    green  square    4.0  
 1    red    triangle  2.0  
└─────┴───────┴──────────┴──────┘
英文:

How about

In [22]: df.groupby("id").agg(
    ...:     pl.col(["color", "shape"]).mode().sort(descending=True).first(),
    ...:     pl.col("size").mean(),
    ...: )
    ...:
Out[22]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
 id   color  shape     size 
 ---  ---    ---       ---  
 i64  str    str       f64  
╞═════╪═══════╪══════════╪══════╡
 1    red    triangle  2.0  
 0    green  square    4.0  
└─────┴───────┴──────────┴──────┘

Or, using the column selectors:

In [30]: import polars.selectors as cs

In [31]: df.groupby("id").agg(
    ...:     cs.string().mode().sort(descending=True).first(),
    ...:     cs.numeric().mean(),
    ...: )
Out[31]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
 id   color  shape     size 
 ---  ---    ---       ---  
 i64  str    str       f64  
╞═════╪═══════╪══════════╪══════╡
 0    green  square    4.0  
 1    red    triangle  2.0  
└─────┴───────┴──────────┴──────┘

答案2

得分: 2

你可以按类型选择列,例如 pl.col(pl.Utf8) 用于所有 str 类型的列。

还有新的 polars.selectors 辅助模块。

import polars.selectors as cs

df.groupby('id').agg(
   cs.string().mode().first(),
   cs.numeric().mean()
)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘
英文:

You can select columns by type, e.g. pl.col(pl.Utf8) for all str columns.

There's also the new polars.selectors helper module.

import polars.selectors as cs

df.groupby('id').agg(
   cs.string().mode().first(),
   cs.numeric().mean()
)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

huangapple
  • 本文由 发表于 2023年6月27日 20:23:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564824.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定