2023年6月15日 11:33:48go评论160阅读模式

英文:

Polars is much slower than DuckDB in conditional join + groupby/agg context

问题

For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, 'DuckDB' gives much better performance than 'Polars' (~10x on a 32-core machine).

My questions are:

What could be the potential reason(s) for the slowness (relative to 'DuckDB') of 'Polars'?
Am I missing some other faster ways of doing the same thing in 'Polars'?

import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 5_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 1_000, nrows),
        id2=rng.integers(1, 10, nrows),
        id3=rng.integers(1, 500, nrows),
        value=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.lazy()
    .join(df.lazy(), on=["id", "id2"], how="left")
    .filter(
        (pl.col("id3") > pl.col("id3_right"))
        & (pl.col("id3") - pl.col("id3_right") < 30)
    )
    .groupby(["id2", "id3", "id3_right"])
    .agg(pl.corr("value", "value_right"))
    .collect(streaming=True)
)
time.perf_counter() - start
# 120.93155245436355

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        """
        SELECT df.*, df2.id3 as id3_right, df2.value as value_right
        FROM df JOIN df as df2
        ON (df.id = df2.id
        AND df.id2 = df2.id2
        AND df.id3 > df2.id3
        AND df.id3 - df2.id3 < 30)
        """
    )
    .aggregate(
        "id2, id3, id3_right, corr(value, value_right) as value",
        "id2, id3, id3_right",
    )
    .pl()
)
time.perf_counter() - start
# 18.472263277042657

(Note: I've removed the HTML escape codes from the code snippet.)

英文:

For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, DuckDB gives much better performance than Polars (~10x on a 32-core machine).

My questions are:

What could be the potential reason(s) for the slowness (relative to DuckDB) of Polars?
Am I missing some other faster ways of doing the same thing in Polars?

import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 5_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 1_000, nrows),
        id2=rng.integers(1, 10, nrows),
        id3=rng.integers(1, 500, nrows),
        value=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.lazy()
    .join(df.lazy(), on=[&quot;id&quot;, &quot;id2&quot;], how=&quot;left&quot;)
    .filter(
        (pl.col(&quot;id3&quot;) &gt; pl.col(&quot;id3_right&quot;))
        &amp; (pl.col(&quot;id3&quot;) - pl.col(&quot;id3_right&quot;) &lt; 30)
    )
    .groupby([&quot;id2&quot;, &quot;id3&quot;, &quot;id3_right&quot;])
    .agg(pl.corr(&quot;value&quot;, &quot;value_right&quot;))
    .collect(streaming=True)
)
time.perf_counter() - start
# 120.93155245436355

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        &quot;&quot;&quot;
        SELECT df.*, df2.id3 as id3_right, df2.value as value_right
        FROM df JOIN df as df2
        ON (df.id = df2.id
        AND df.id2 = df2.id2
        AND df.id3 &gt; df2.id3
        AND df.id3 - df2.id3 &lt; 30)
        &quot;&quot;&quot;
    )
    .aggregate(
        &quot;id2, id3, id3_right, corr(value, value_right) as value&quot;,
        &quot;id2, id3, id3_right&quot;,
    )
    .pl()
)
time.perf_counter() - start
# 18.472263277042657

答案1

得分: 3

EDIT: 2023-7-18

最新的 Polars 发布将差距从 15 倍降至 2 倍。

polars v0.18.2 1125
polars v0.18.3 140
duckdb 0.8.2-dev1 75

Original answer

流处理引擎

流处理 API 还没有进行过优化。Polars 项目比 DuckDB 年轻，我们没有像 DuckDB 那样多的付费开发人员。

所以请给我们时间。下一个版本 0.18.3 将会推出一个可以使流式分组操作快 3.5 倍的 PR，详情请查看 https://github.com/pola-rs/polars/pull/9346。

这只是显示了我们在流处理引擎方面还有多少工作要做。我们还需要对流式连接执行相同的优化。

简而言之，我们的流处理引擎处于 alpha 阶段，还在不断改进中。

不同的算法

除此之外，DuckDB 查询可能还在底层使用非等值连接，而这在 Polars 中尚未实现，因此对于 Polars 来说，这个查询可能不太优化。

英文:

EDIT: 2023-7-18

The latest polars release has brought the difference down from 15x to 2x.

polars v0.18.2 	1125
polars v0.18.3 	140
duckdb 0.8.2-dev1 	75

Original answer

Streaming engine

The streaming API isn't as optimized yet. Polars is a younger project than DuckDB and we haven't got as many paid developers on the project.

So give us time. Next release 0.18.3 will land a PR that can make a streaming groupby over 3.5x faster https://github.com/pola-rs/polars/pull/9346.

That just shows how much we still have on the table on the streaming engine. That same optimization we still have to do for streaming joins.

In short. Our streaming engine is in alpha stage. It is work in progress.

Different algorithm

Other that that, the duckdb query might also be using non-equi joins under the hood which we don't have yet in polars, so this query might not be as optimal for polars.

答案2

得分: 2

DuckDB目前虽然支持多个非等值连接，但查询规划器当前假设所有等值谓词都比不等值选择性更高，因此在这里生成哈希连接：

D EXPLAIN SELECT id2, id3, id3_right, corr(value, value_right) as value
&gt; FROM (
&gt;     SELECT df.*, df2.id3 as id3_right, df2.value as value_right
&gt;     FROM df JOIN df as df2
&gt;         ON (df.id = df2.id
&gt;         AND df.id2 = df2.id2
&gt;         AND df.id3 &gt; df2.id3
&gt;         AND df.id3 - df2.id3 &lt; 30)
&gt;     ) tbl
&gt; GROUP BY ALL
&gt; ;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐                             
│       HASH_GROUP_BY       │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             #0            │                             
│             #1            │                             
│             #2            │                             
│        corr(#3, #4)       │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│            id2            │                             
│            id3            │                             
│         id3_right         │                             
│           value           │                             
│        value_right        │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             #1            │                             
│             #2            │                             
│             #3            │                             
│             #4            │                             
│             #5            │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│           FILTER          │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│     ((id3 - id3) &lt; 30)    │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│      EC: 24727992087      │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         HASH_JOIN         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           INNER           │                             
│          id = id          │                             
│         id2 = id2         ├──────────────┐              
│         id3 &gt; id3         │              │              
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              
│      EC: 24727992087      │              │              
│     Cost: 24727992087     │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             df            ││             df            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             id            ││             id            │
│            id2            ││            id2            │
│            id3            ││            id3            │
│           value           ││           value           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 5000000        ││        EC: 5000000        │
└───────────────────────────┘└───────────────────────────┘

我们计划在将来的版本中解决这个问题。

还请注意，IEJoin算法需要两个不等式，而查询只有一个。单一不等式可以由PieceWiseMergeJoin运算符处理，但PWMJ目前不处理简单的等式（逻辑只需扩展以正确处理NULL）。

英文:

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities an just generates a hash join here:

D EXPLAIN SELECT id2, id3, id3_right, corr(value, value_right) as value
&gt; FROM (
&gt;     SELECT df.*, df2.id3 as id3_right, df2.value as value_right
&gt;     FROM df JOIN df as df2
&gt;         ON (df.id = df2.id
&gt;         AND df.id2 = df2.id2
&gt;         AND df.id3 &gt; df2.id3
&gt;         AND df.id3 - df2.id3 &lt; 30)
&gt;     ) tbl
&gt; GROUP BY ALL
&gt; ;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐                             
│       HASH_GROUP_BY       │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             #0            │                             
│             #1            │                             
│             #2            │                             
│        corr(#3, #4)       │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│            id2            │                             
│            id3            │                             
│         id3_right         │                             
│           value           │                             
│        value_right        │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             #1            │                             
│             #2            │                             
│             #3            │                             
│             #4            │                             
│             #5            │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│           FILTER          │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│     ((id3 - id3) &lt; 30)    │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│      EC: 24727992087      │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         HASH_JOIN         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           INNER           │                             
│          id = id          │                             
│         id2 = id2         ├──────────────┐              
│         id3 &gt; id3         │              │              
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              
│      EC: 24727992087      │              │              
│     Cost: 24727992087     │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             df            ││             df            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             id            ││             id            │
│            id2            ││            id2            │
│            id3            ││            id3            │
│           value           ││           value           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 5000000        ││        EC: 5000000        │
└───────────────────────────┘└───────────────────────────┘

We plan to address this in a future release.

Note also that the IEJoin algorithm requires two inequalities and the query only has one. Single inequalities could be handled by the PieceWiseMergeJoin operator, but PWMJ does not currently handle simple equalities (the logic would just have to be extended to handle NULLs correctly).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Polars在条件连接+分组/聚合上比DuckDB慢得多。

问题

答案1

EDIT: 2023-7-18

Original answer

流处理引擎

不同的算法

EDIT: 2023-7-18

Original answer

Streaming engine

Different algorithm

答案2

query-exporter在Docker容器中无法工作。

pandas按一列分组，除非另一列有不同的条目。

如何在 Dynamo 表中更新带有 2 个字段的计数。

设置使用Python 3.9 + Poetry配置Docker时，出现’不包含任何元素’错误。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论