根据其他数据框进行筛选和聚合。

huangapple go评论58阅读模式
英文:

Filter and aggregate based on other dataframe

问题

你可以使用 Polars 的原生表达式 API 来更高效地完成这个任务。以下是使用 Polars 原生表达式 API 的示例代码:

result = df1.with_column(
    pl.when(
        pl.col("start").is_between(pl.col("idx"), pl.col("end"), closed="left")
    )
    .then(pl.col("values"))
    .otherwise(0)
    .alias("sum_values")
)

expected = result.groupby("start", "end").agg(pl.sum("sum_values"))

# 如果需要与你期望的输出匹配的格式,你可以使用以下代码:
expected = expected.rename(
    [
        "start",
        "end",
        "sum_values",
    ]
)

这段代码使用 Polars 的 whenotherwise 方法来创建一个新的列 sum_values,然后使用 groupbyagg 来计算每组的总和。最后,你可以根据需要对列进行重命名,以匹配你期望的输出格式。

英文:

Say I have

df1 = pl.DataFrame({'start': [1., 2., 4.], 'end': [2, 4., 6]})
df2 = pl.DataFrame({'idx': [1, 1.7, 2.3, 2.5, 3., 4], 'values': [3, 1, 4, 2, 3, 5]})

They look like this:

In [8]: df1
Out[8]:
shape: (3, 2)
┌───────┬─────┐
 start  end 
 ---    --- 
 f64    f64 
╞═══════╪═════╡
 1.0    2.0 
 2.0    4.0 
 4.0    6.0 
└───────┴─────┘

In [9]: df2
Out[9]:
shape: (6, 2)
┌─────┬────────┐
 idx  values 
 ---  ---    
 f64  i64    
╞═════╪════════╡
 1.0  3      
 1.7  1      
 2.3  4      
 2.5  2      
 3.0  3      
 4.0  5      
└─────┴────────┘

I would like to end up with something like this:

In [6]: expected = pl.DataFrame({
   ...:     'start': [1., 2., 4.],
   ...:     'end': [2., 4.5, 6.],
   ...:     'sum_values': [4, 9, 5]
   ...: })

In [7]: expected
Out[7]:
shape: (3, 3)
┌───────┬─────┬────────────┐
 start  end  sum_values 
 ---    ---  ---        
 f64    f64  i64        
╞═══════╪═════╪════════════╡
 1.0    2.0  4          
 2.0    4.5  9          
 4.0    6.0  5          
└───────┴─────┴────────────┘

Here's an inefficient way of doing it I came up with, using apply:

(
    df1.with_columns(
        df1.apply(
            lambda row: df2.filter(
                pl.col("idx").is_between(row[0], row[1], closed="left")
            )["values"].sum()
        )["apply"].alias("sum_values")
    )
)

It gives the correct output, but because it uses apply and a Python lambda function, it's not as performant as it could be.

Is there a way to write this using polars native expressions API?

答案1

得分: 2

我不确定是否有其他方法,除了交叉连接:

(df1.join(df2, how='cross')
.filter(pl.col('idx').is_between('start', 'end', closed='left'))
.groupby('start', 'end')
.sum()
)

形状:(3, 4)
┌───────┬─────┬─────┬────────┐
│ start ┆ end ┆ idx ┆ values │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═══════╪═════╪═════╪════════╡
│ 4.0 ┆ 6.0 ┆ 4.0 ┆ 5 │
│ 1.0 ┆ 2.0 ┆ 2.7 ┆ 4 │
│ 2.0 ┆ 4.0 ┆ 7.8 ┆ 9 │
└───────┴─────┴─────┴────────┘


<details>
<summary>英文:</summary>

I&#39;m not sure if there is another way apart from a cross join:

(df1.join(df2, how='cross')
.filter(pl.col('idx').is_between('start', 'end', closed='left'))
.groupby('start', 'end')
.sum()
)

shape: (3, 4)
┌───────┬─────┬─────┬────────┐
│ start ┆ end ┆ idx ┆ values │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═══════╪═════╪═════╪════════╡
│ 4.0 ┆ 6.0 ┆ 4.0 ┆ 5 │
│ 1.0 ┆ 2.0 ┆ 2.7 ┆ 4 │
│ 2.0 ┆ 4.0 ┆ 7.8 ┆ 9 │
└───────┴─────┴─────┴────────┘



</details>



huangapple
  • 本文由 发表于 2023年6月18日 18:48:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500139.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定