英文:
How can I drop duplicate rows from a Polars DataFrame according to a custom function?
问题
我试图在特定列中删除所有重复项并聚合非重复项的数据帧。.unique
函数让我选择{‘first’, ‘last’, ‘any’, ‘none’}
之一。然而,我想对所有数值应用mean
函数,对所有分类值应用mode
函数。
我可以通过在我感兴趣的列上使用groupby
来实现这一点,就像下面的示例中所示:
df = pl.DataFrame(
{
"id": [0, 0, 0, 1, 1],
"color": ["red", "green", "green", "red", "red"],
"shape": ["square", "triangle", "square", "triangle", "square"],
"size": [2, 4, 6, 1, 3]
}
)
df_list = []
for gkey, group in df.groupby("id"):
g = group.select(pl.col("id"),
pl.all().exclude("id", "size").mode().first(),
pl.col("size").mean()
).unique()
df_list.append(g)
df_dedup = pl.concat(df_list)
这会给我期望的输出:
> print(df_dedup)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id ┆ color ┆ shape ┆ size │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╪══════╡
│ 1 ┆ red ┆ triangle ┆ 2.0 │
│ 0 ┆ green ┆ square ┆ 4.0 │
└─────┴───────┴──────────┴──────┘
问题是,这个实现(不出所料)相当慢,所以我想知道是否有更好的方法来实现这一点,或者是否有办法优化我现有的代码。
英文:
I have a dataframe where I am trying to remove all duplicates in a specific column while aggregating the non-duplicated values.
The .unique
function lets me choose only one of {‘first’, ‘last’, ‘any’, ‘none’}
. However what I want is to apply the mean
function to all numeric values, and the mode
function to all categorical values.
I can do this by using groupby
on the column I'm interested in, like in the example below:
df = pl.DataFrame(
{
"id": [0, 0, 0, 1, 1],
"color": ["red", "green", "green", "red", "red"],
"shape": ["square", "triangle", "square", "triangle", "square"],
"size": [2, 4, 6, 1, 3]
}
)
df_list = []
for gkey, group in df.groupby("id"):
g = group.select(pl.col("id"),
pl.all().exclude("id", "size").mode().first(),
pl.col("size").mean()
).unique()
df_list.append(g)
df_dedup = pl.concat(df_list)
Which gives me the output I'm expecting:
> print(df_dedup)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id ┆ color ┆ shape ┆ size │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╪══════╡
│ 1 ┆ red ┆ triangle ┆ 2.0 │
│ 0 ┆ green ┆ square ┆ 4.0 │
└─────┴───────┴──────────┴──────┘
The problem is that this implementation is (unsurprisingly) quite slow, so I was wondering if there was a better way of doing this, or if it's possible to optimize the code I have in some way.
答案1
得分: 2
以下是翻译好的部分:
- How about
In [22]: df.groupby("id").agg(
...: pl.col(["color", "shape"]).mode().sort(descending=True).first(),
...: pl.col("size").mean(),
...: )
...:
Out[22]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id ┆ color ┆ shape ┆ size │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╪══════╡
│ 1 ┆ red ┆ triangle ┆ 2.0 │
│ 0 ┆ green ┆ square ┆ 4.0 │
└─────┴───────┴──────────┴──────┘
或者,使用列选择器:
In [30]: import polars.selectors as cs
In [31]: df.groupby("id").agg(
...: cs.string().mode().sort(descending=True).first(),
...: cs.numeric().mean(),
...: )
Out[31]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id ┆ color ┆ shape ┆ size │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╪══════╡
│ 0 ┆ green ┆ square ┆ 4.0 │
│ 1 ┆ red ┆ triangle ┆ 2.0 │
└─────┴───────┴──────────┴──────┘
英文:
How about
In [22]: df.groupby("id").agg(
...: pl.col(["color", "shape"]).mode().sort(descending=True).first(),
...: pl.col("size").mean(),
...: )
...:
Out[22]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id ┆ color ┆ shape ┆ size │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╪══════╡
│ 1 ┆ red ┆ triangle ┆ 2.0 │
│ 0 ┆ green ┆ square ┆ 4.0 │
└─────┴───────┴──────────┴──────┘
Or, using the column selectors:
In [30]: import polars.selectors as cs
In [31]: df.groupby("id").agg(
...: cs.string().mode().sort(descending=True).first(),
...: cs.numeric().mean(),
...: )
Out[31]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id ┆ color ┆ shape ┆ size │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╪══════╡
│ 0 ┆ green ┆ square ┆ 4.0 │
│ 1 ┆ red ┆ triangle ┆ 2.0 │
└─────┴───────┴──────────┴──────┘
答案2
得分: 2
你可以按类型选择列,例如 pl.col(pl.Utf8)
用于所有 str
类型的列。
import polars.selectors as cs
df.groupby('id').agg(
cs.string().mode().first(),
cs.numeric().mean()
)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id ┆ color ┆ shape ┆ size │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╪══════╡
│ 1 ┆ red ┆ triangle ┆ 2.0 │
│ 0 ┆ green ┆ square ┆ 4.0 │
└─────┴───────┴──────────┴──────┘
英文:
You can select columns by type, e.g. pl.col(pl.Utf8)
for all str
columns.
There's also the new polars.selectors
helper module.
import polars.selectors as cs
df.groupby('id').agg(
cs.string().mode().first(),
cs.numeric().mean()
)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id ┆ color ┆ shape ┆ size │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪══════════╪══════╡
│ 1 ┆ red ┆ triangle ┆ 2.0 │
│ 0 ┆ green ┆ square ┆ 4.0 │
└─────┴───────┴──────────┴──────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论