使用Polars:在每个组内将空值填充为唯一的有效值。

huangapple go评论120阅读模式
英文:

Polars: fill nulls with the only vaild value within each group

问题

In Polars, you can achieve the desired output using window functions. Here's how you can do it for each column:

import polars as pl

# Your DataFrame
data = {
    'group': ['1', '1', '1', '2', '2', '2', '3', '3', '3'],
    'col1': [1, None, None, None, 3, None, None, None, 5],
    'col2': ['a', None, None, None, 'b', None, None, None, 'c'],
    'col3': [False, None, None, None, True, None, None, None, False]
}
df = pl.DataFrame(data)

# Define a window specification partitioned by 'group'
window_spec = pl.partition_by('group').order_by(pl.col('group'))

# Apply forward fill within each partition
df = df.with_column(
    pl.when(pl.col('col1').is_not_null())
    .then(pl.col('col1'))
    .over(window_spec),
    alias="col1"
)

df = df.with_column(
    pl.when(pl.col('col2').is_not_null())
    .then(pl.col('col2'))
    .over(window_spec),
    alias="col2"
)

df = df.with_column(
    pl.when(pl.col('col3').is_not_null())
    .then(pl.col('col3'))
    .over(window_spec),
    alias="col3"
)

print(df)

This code partitions the data by the 'group' column and then uses window functions to apply forward fill for each column. The result will match your desired output.

英文:

Each group only has one valid or not_null value in a random row. How do you fill each group with that value?

import polars as pl

data = {
    'group': ['1', '1', '1', '2', '2', '2', '3', '3', '3'],
    'col1': [1, None, None, None, 3, None, None, None, 5],
    'col2': ['a', None, None, None, 'b', None, None, None, 'c'],
    'col3': [False, None, None, None, True, None, None, None, False]
}
df = pl.DataFrame(data)
shape: (9, 4)
┌───────┬──────┬──────┬───────┐
│ group ┆ col1 ┆ col2 ┆ col3  │
│ ---   ┆ ---  ┆ ---  ┆ ---   │
│ str   ┆ i64  ┆ str  ┆ bool  │
╞═══════╪══════╪══════╪═══════╡
│ 1     ┆ 1    ┆ a    ┆ false │
│ 1     ┆ null ┆ null ┆ null  │
│ 1     ┆ null ┆ null ┆ null  │
│ 2     ┆ null ┆ null ┆ null  │
│ 2     ┆ 3    ┆ b    ┆ true  │
│ 2     ┆ null ┆ null ┆ null  │
│ 3     ┆ null ┆ null ┆ null  │
│ 3     ┆ null ┆ null ┆ null  │
│ 3     ┆ 5    ┆ c    ┆ false │
└───────┴──────┴──────┴───────┘

Desired output:

shape: (9, 4)
┌───────┬──────┬──────┬───────┐
│ group ┆ col1 ┆ col2 ┆ col3  │
│ ---   ┆ ---  ┆ ---  ┆ ---   │
│ str   ┆ i64  ┆ str  ┆ bool  │
╞═══════╪══════╪══════╪═══════╡
│ 1     ┆ 1    ┆ a    ┆ false │
│ 1     ┆ 1    ┆ a    ┆ false │
│ 1     ┆ 1    ┆ a    ┆ false │
│ 2     ┆ 3    ┆ b    ┆ true  │
│ 2     ┆ 3    ┆ b    ┆ true  │
│ 2     ┆ 3    ┆ b    ┆ true  │
│ 3     ┆ 5    ┆ c    ┆ false │
│ 3     ┆ 5    ┆ c    ┆ false │
│ 3     ┆ 5    ┆ c    ┆ false │
└───────┴──────┴──────┴───────┘

In pandas, I can do the following for each column

import pandas as pd

df = pd.DataFrame(data)
df.col1 = df.groupby('group').col.apply(lambda x: x.ffill().bfill())

How do you do this in polars, ideally with a window function (.over()) ?

答案1

得分: 3

进行确切的操作,最接近您提到的 pandas 方法的方式是:

df.with_columns(pl.exclude('group').forward_fill().backward_fill().over('group'))

使用 pl.all() 而不是 pl.exclude('group') 也可以,但通过不在 group 列中查找填充值可以节省一些理论上的时间。

如果要对一系列列执行此操作(而不是除了 group 之外的所有列),则可以将 pl.exclude 替换为生成器或列表推导式:

cols = ['col1', 'col2', 'col3']
df.with_columns(pl.col(x).forward_fill().backward_fill().over('group') for x in cols)

您甚至可以在 pl.col 中使用正则表达式,只要使用 ^$ 锚点:

df.with_columns(pl.col("^col\d$").forward_fill().backward_fill().over('group'))

除了前向/后向填充之外的另一种方法是:

df.with_columns(pl.col("^col\d$").drop_nulls().first().over('group'))

如果 first 看起来有点奇怪,那是因为 drop_nulls 将返回不同行数的行,这将导致错误。如果表达式是一个聚合函数(如 summinmax 等),则不会因为获得不同行数的行而报错,而是将该答案传播到所有行。在这种情况下,first 是聚合函数,只是表示它找到的第一个值。由于过滤器只返回一个值,我们只需要一种方法来告诉它传播那个值。

在这种方法中,不同的列选择技巧也适用,但我会避免多余的复制粘贴。

最后注意:
如果您的下一步是执行 unique 操作,那么最好从 df.groupby 开始:

df \
    .groupby('group', maintain_order=True) \
    .agg(pl.col("^col\d$").drop_nulls().first())
英文:

The immediate way to do exactly what you asked is (and it looks the most like your pandas approach):

df.with_columns(pl.exclude('group').forward_fill().backward_fill().over('group'))

using pl.all() instead of pl.exclude('group') also works but it'll save some theoretical time by not making it look through the group column for the fills.

If there's a list of columns you want to do this to (as opposed to all but group) then you can replace the pl.exclude with a generator or list comprehension

cols=['col1','col2','col3']
df.with_columns(pl.col(x).forward_fill().backward_fill().over('group') for x in cols)

You can even use regex in pl.col as long as you use the ^ and $ anchor.

df.with_columns(pl.col("^col\d$").forward_fill().backward_fill().over('group'))

Another approach besides forward/backward fills:

df.with_columns(pl.col("^col\d$").drop_nulls().first().over('group'))

If the first looks a little weird it's because the drop_nulls is going to return a different number of rows than the original df which will cause an error. If the expression is an aggregation (like sum, min, max, etc) then it doesn't complain about getting a different number of rows and, instead, just propagates that answer to all the rows. In this case first is the aggregation which just means the first thing it sees. Since the filter is only returning one thing we just need a way to tell it to propagate that.

The different column selection tricks work in this approach too but I'll spare the reader the extra copy/paste

Final note:

if your next step is to take unique then you should just do it as a df.groupby to start with

df \
    .groupby('group', maintain_order=True) \
    .agg(pl.col("^col\d$").drop_nulls().first())

答案2

得分: 1

以下是翻译好的部分:

一种选择是首先提取每个组中的非空行,然后将这些值加入到表中。

import polars as pl

data = pl.DataFrame{
    'group': ['1', '1', '1', '2', '2', '2', '3', '3', '3'],
    'col': [1, None, None, None, 3, None, None, None, 5]
})

data_non_null = (
    data
    .filter(pl.col('col').is_not_null())
    .select('group', col_non_null='col')
)

data = (
    data
    .join(data_non_null, on='group')
    .select('group', col='col_non_null')
)

这是一个非常通用的模式的应用,也可以在Pandas、SQL、各种R数据框库等中使用。

请注意,通常需要进行某种分组操作来应用此模式。但在这种情况下,我们知道/假定每个组中确切有一个非空值,因此我们只需要筛选非空值。

英文:

One option is to first extract the non-null rows in each group, and then join those values back into the table.

import polars as pl

data = pl.DataFrame{
    'group': ['1', '1', '1', '2', '2', '2', '3', '3', '3'],
    'col': [1, None, None, None, 3, None, None, None, 5]
})

data_non_null = (
    data
    .filter(pl.col('col').is_not_null())
    .select('group', col_non_null='col')
)

data = (
    data
    .join(data_non_null, on='group')
    .select('group', col='col_non_null')
)

This is an application of a very general pattern that should also work in Pandas, SQL, the various R data frame libraries, etc.

Note that usually you need some kind of grouping operation to apply this pattern. Here however we know/assume that there is exactly one non-null value in each group, so we just have to filter for non-null values.

huangapple
  • 本文由 发表于 2023年3月23日 08:10:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75818269.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定