2023年2月10日 04:42:27go评论66阅读模式

英文:

How to use apply better in Polars?

问题

import polars as pl
import numpy as np

# Create the polars DataFrame
df = pl.DataFrame({
    "a": [1, 4, 3, 2, 8, 4, 5, 6],
    "b": [2, 3, 1, 3, 9, 7, 6, 8],
    "c": [1, 1, 1, 1, 2, 2, 2, 2],
})

# Define the convert function
def convert(a, b):
    # Your logic for converting a and b
    pass

# Define a custom UDF for applying the convert function
@pl.udf
def apply_convert(a: pl.Series, b: pl.Series, c: int) -> pl.Series:
    a_np = np.array(a.to_list())
    b_np = np.array(b.to_list())
    if (a_np < b_np).all():
        return a
    else:
        converted_values = convert(a_np, b_np)
        return pl.Series(converted_values)

# Apply the groupby and custom UDF
result = df.groupby("c").agg(apply_convert(pl.col("a"), pl.col("b"), pl.col("c")).alias("a"))

# Print the result
print(result)

英文:

I have a polars dataframe illustrated as follows.

import polars as pl

df = pl.DataFrame(
    {
        &quot;a&quot;: [1, 4, 3, 2, 8, 4, 5, 6],
        &quot;b&quot;: [2, 3, 1, 3, 9, 7, 6, 8],
        &quot;c&quot;: [1, 1, 1, 1, 2, 2, 2, 2],
    }
)

The task I have is

groupby column "c"
for each group, check whether all numbers from column "a" is less than corresponding values from column "b".
- If so, just return a column same as "a" in the groupby context.
- Otherwise, apply a third-party function called "convert" which takes two numpy arrays and return a single numpy array with the same size, so in my case, I can first convert column "a" and "b" to numpy arrays and supply them as inputs to "convert". Finally, return the array returned from "convert" (probably need to transform it to polars series before returning) in the groupby context.

So, for the example above, the output I want is as follows (exploded after groupby for better illustration).

shape: (8, 2)
┌─────┬─────┐
│ c   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 1   ┆ 3   │
│ 1   ┆ 1   │
│ 1   ┆ 2   │
│ 2   ┆ 8   │
│ 2   ┆ 4   │
│ 2   ┆ 5   │
│ 2   ┆ 6   │
└─────┴─────┘

With the assumption,

&gt;&gt;&gt; import numpy as np
&gt;&gt;&gt; convert(np.array([1, 4, 3, 2]), np.array([2, 3, 1, 3]))
np.array([1, 3, 1, 2])

# [1, 4, 3, 2] is from column a of df when column c is 1, and [2, 3, 1, 3] comes from column b of df when column c is 1.
# I have to apply my custom python function &#39;convert&#39; for the c == 1 group, because not all values in a are smaller than those in b according to the task description above.

My question is how am I supposed to implement this logic in a performant or polars idiomatic way without sacrificing so much speed gained from running Rust code and parallelization?

The reason I ask is because from my understanding, using apply with custom python function will slow down the program, but in my case, in certain scenarios, I will not need to resort to a third-party function for help. So, is there any way I can get the best of worlds somehow? (for scenarios where no third-party function is required, get full benefits of polars, and only apply third-party function when necessary).

答案1

得分: 3

以下是翻译好的部分：

It sounds like you want to find matching groups:

And apply your custom function over each group.

(Note: there may be some useful information in https://stackoverflow.com/questions/75303038/how-to-write-poisson-cdf-as-python-polars-expression/75311287 with regards to scipy/numpy ufuncs and potentially avoiding `.apply()`)

You can then `.join()` the result back into the original data.

You can then fill in the nulls.

    .with_columns(
       pl.col("a_right").fill_null(pl.col("a")))

请注意，代码部分没有翻译。如果需要进一步的翻译，请提供具体的内容。

英文:

It sounds like you want to find matching groups:

(
   df
   .with_row_count()
   .filter(
      (pl.col(&quot;a&quot;) &gt;= pl.col(&quot;b&quot;))
      .any()
      .over(&quot;c&quot;))
)

shape: (4, 4)
┌────────┬─────┬─────┬─────┐
│ row_nr | a   | b   | c   │
│ ---    | --- | --- | --- │
│ u32    | i64 | i64 | i64 │
╞════════╪═════╪═════╪═════╡
│ 0      | 1   | 2   | 1   │
│ 1      | 4   | 3   | 1   │
│ 2      | 3   | 1   | 1   │
│ 3      | 2   | 3   | 1   │
└────────┴─────┴─────┴─────┘

And apply your custom function over each group.

(      
   df
   .with_row_count()
   .filter(
      (pl.col(&quot;a&quot;) &gt;= pl.col(&quot;b&quot;))
      .any()
      .over(&quot;c&quot;))
   .select(
      pl.col(&quot;row_nr&quot;),
      pl.apply(
         [&quot;a&quot;, &quot;b&quot;], # np.minimum is just for example purposes
         lambda s: np.minimum(s[0], s[1]))
      .over(&quot;c&quot;))
)

shape: (4, 2)
┌────────┬─────┐
│ row_nr | a   │
│ ---    | --- │
│ u32    | i64 │
╞════════╪═════╡
│ 0      | 1   │
│ 1      | 3   │
│ 2      | 1   │
│ 3      | 2   │
└────────┴─────┘

(Note: there may be some useful information in https://stackoverflow.com/questions/75303038/how-to-write-poisson-cdf-as-python-polars-expression/75311287 with regards to scipy/numpy ufuncs and potentially avoiding .apply())

You can then .join() the result back into the original data.

(
   df
   .with_row_count()
   .join(
      df
      .with_row_count()
      .filter(
         (pl.col(&quot;a&quot;) &gt;= pl.col(&quot;b&quot;))
         .any()
         .over(&quot;c&quot;))
      .select(
         pl.col(&quot;row_nr&quot;),
         pl.apply(
            [&quot;a&quot;, &quot;b&quot;],
            lambda s: np.minimum(s[0], s[1]))
         .over(&quot;c&quot;)),
      on=&quot;row_nr&quot;,
      how=&quot;left&quot;)
)

shape: (8, 5)
┌────────┬─────┬─────┬─────┬─────────┐
│ row_nr | a   | b   | c   | a_right │
│ ---    | --- | --- | --- | ---     │
│ u32    | i64 | i64 | i64 | i64     │
╞════════╪═════╪═════╪═════╪═════════╡
│ 0      | 1   | 2   | 1   | 1       │
│ 1      | 4   | 3   | 1   | 3       │
│ 2      | 3   | 1   | 1   | 1       │
│ 3      | 2   | 3   | 1   | 2       │
│ 4      | 8   | 9   | 2   | null    │
│ 5      | 4   | 7   | 2   | null    │
│ 6      | 5   | 6   | 2   | null    │
│ 7      | 6   | 8   | 2   | null    │
└────────┴─────┴─────┴─────┴─────────┘

You can then fill in the nulls.

.with_columns(
   pl.col(&quot;a_right&quot;).fill_null(pl.col(&quot;a&quot;)))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 Polars 中更好地使用 apply？

问题

答案1

plt.scatter图绘制为空白。

错误：模块 ‘pysynth’ 没有 ‘make_wav’ 属性

创建动态类属性

从Pyspark数据帧中创建字典时显示OutOfMemoryError: Java堆空间。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论