Polars在条件连接+分组/聚合上比DuckDB慢得多。

huangapple go评论58阅读模式
英文:

Polars is much slower than DuckDB in conditional join + groupby/agg context

问题

For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, 'DuckDB' gives much better performance than 'Polars' (~10x on a 32-core machine).

My questions are:

  1. What could be the potential reason(s) for the slowness (relative to 'DuckDB') of 'Polars'?
  2. Am I missing some other faster ways of doing the same thing in 'Polars'?
import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 5_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 1_000, nrows),
        id2=rng.integers(1, 10, nrows),
        id3=rng.integers(1, 500, nrows),
        value=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.lazy()
    .join(df.lazy(), on=["id", "id2"], how="left")
    .filter(
        (pl.col("id3") > pl.col("id3_right"))
        & (pl.col("id3") - pl.col("id3_right") < 30)
    )
    .groupby(["id2", "id3", "id3_right"])
    .agg(pl.corr("value", "value_right"))
    .collect(streaming=True)
)
time.perf_counter() - start
# 120.93155245436355

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        """
        SELECT df.*, df2.id3 as id3_right, df2.value as value_right
        FROM df JOIN df as df2
        ON (df.id = df2.id
        AND df.id2 = df2.id2
        AND df.id3 > df2.id3
        AND df.id3 - df2.id3 < 30)
        """
    )
    .aggregate(
        "id2, id3, id3_right, corr(value, value_right) as value",
        "id2, id3, id3_right",
    )
    .pl()
)
time.perf_counter() - start
# 18.472263277042657

(Note: I've removed the HTML escape codes from the code snippet.)

英文:

For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, DuckDB gives much better performance than Polars (~10x on a 32-core machine).

My questions are:

  1. What could be the potential reason(s) for the slowness (relative to DuckDB) of Polars?
  2. Am I missing some other faster ways of doing the same thing in Polars?
import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 5_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 1_000, nrows),
        id2=rng.integers(1, 10, nrows),
        id3=rng.integers(1, 500, nrows),
        value=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.lazy()
    .join(df.lazy(), on=[&quot;id&quot;, &quot;id2&quot;], how=&quot;left&quot;)
    .filter(
        (pl.col(&quot;id3&quot;) &gt; pl.col(&quot;id3_right&quot;))
        &amp; (pl.col(&quot;id3&quot;) - pl.col(&quot;id3_right&quot;) &lt; 30)
    )
    .groupby([&quot;id2&quot;, &quot;id3&quot;, &quot;id3_right&quot;])
    .agg(pl.corr(&quot;value&quot;, &quot;value_right&quot;))
    .collect(streaming=True)
)
time.perf_counter() - start
# 120.93155245436355

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        &quot;&quot;&quot;
        SELECT df.*, df2.id3 as id3_right, df2.value as value_right
        FROM df JOIN df as df2
        ON (df.id = df2.id
        AND df.id2 = df2.id2
        AND df.id3 &gt; df2.id3
        AND df.id3 - df2.id3 &lt; 30)
        &quot;&quot;&quot;
    )
    .aggregate(
        &quot;id2, id3, id3_right, corr(value, value_right) as value&quot;,
        &quot;id2, id3, id3_right&quot;,
    )
    .pl()
)
time.perf_counter() - start
# 18.472263277042657

答案1

得分: 3

EDIT: 2023-7-18

最新的 Polars 发布将差距从 15 倍降至 2 倍。

polars v0.18.2 1125
polars v0.18.3 140
duckdb 0.8.2-dev1 75

Original answer

流处理引擎

流处理 API 还没有进行过优化。Polars 项目比 DuckDB 年轻,我们没有像 DuckDB 那样多的付费开发人员。

所以请给我们时间。下一个版本 0.18.3 将会推出一个可以使流式分组操作快 3.5 倍的 PR,详情请查看 https://github.com/pola-rs/polars/pull/9346

这只是显示了我们在流处理引擎方面还有多少工作要做。我们还需要对流式连接执行相同的优化。

简而言之,我们的流处理引擎处于 alpha 阶段,还在不断改进中。

不同的算法

除此之外,DuckDB 查询可能还在底层使用非等值连接,而这在 Polars 中尚未实现,因此对于 Polars 来说,这个查询可能不太优化。

英文:

EDIT: 2023-7-18

The latest polars release has brought the difference down from 15x to 2x.

polars v0.18.2 	1125
polars v0.18.3 	140
duckdb 0.8.2-dev1 	75

Original answer

Streaming engine

The streaming API isn't as optimized yet. Polars is a younger project than DuckDB and we haven't got as many paid developers on the project.

So give us time. Next release 0.18.3 will land a PR that can make a streaming groupby over 3.5x faster https://github.com/pola-rs/polars/pull/9346.

That just shows how much we still have on the table on the streaming engine. That same optimization we still have to do for streaming joins.

In short. Our streaming engine is in alpha stage. It is work in progress.

Different algorithm

Other that that, the duckdb query might also be using non-equi joins under the hood which we don't have yet in polars, so this query might not be as optimal for polars.

答案2

得分: 2

DuckDB目前虽然支持多个非等值连接,但查询规划器当前假设所有等值谓词都比不等值选择性更高,因此在这里生成哈希连接:

D EXPLAIN SELECT id2, id3, id3_right, corr(value, value_right) as value
&gt; FROM (
&gt;     SELECT df.*, df2.id3 as id3_right, df2.value as value_right
&gt;     FROM df JOIN df as df2
&gt;         ON (df.id = df2.id
&gt;         AND df.id2 = df2.id2
&gt;         AND df.id3 &gt; df2.id3
&gt;         AND df.id3 - df2.id3 &lt; 30)
&gt;     ) tbl
&gt; GROUP BY ALL
&gt; ;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐                             
       HASH_GROUP_BY                                    
                                             
             #0                                         
             #1                                         
             #2                                         
        corr(#3, #4)                                    
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
         PROJECTION                                     
                                             
            id2                                         
            id3                                         
         id3_right                                      
           value                                        
        value_right                                     
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
         PROJECTION                                     
                                             
             #1                                         
             #2                                         
             #3                                         
             #4                                         
             #5                                         
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
           FILTER                                       
                                             
     ((id3 - id3) &lt; 30)                                 
                                             
      EC: 24727992087                                   
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
         HASH_JOIN                                      
                                             
           INNER                                        
          id = id                                       
         id2 = id2         ├──────────────┐              
         id3 &gt; id3                                     
                                            
      EC: 24727992087                                  
     Cost: 24727992087                                 
└─────────────┬─────────────┘                                                         
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
         SEQ_SCAN          ││         SEQ_SCAN          
                ││                
             df            ││             df            
                ││                
             id            ││             id            
            id2            ││            id2            
            id3            ││            id3            
           value           ││           value           
                ││                
        EC: 5000000        ││        EC: 5000000        
└───────────────────────────┘└───────────────────────────┘    

我们计划在将来的版本中解决这个问题。

还请注意,IEJoin算法需要两个不等式,而查询只有一个。单一不等式可以由PieceWiseMergeJoin运算符处理,但PWMJ目前不处理简单的等式(逻辑只需扩展以正确处理NULL)。

英文:

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities an just generates a hash join here:

D EXPLAIN SELECT id2, id3, id3_right, corr(value, value_right) as value
&gt; FROM (
&gt;     SELECT df.*, df2.id3 as id3_right, df2.value as value_right
&gt;     FROM df JOIN df as df2
&gt;         ON (df.id = df2.id
&gt;         AND df.id2 = df2.id2
&gt;         AND df.id3 &gt; df2.id3
&gt;         AND df.id3 - df2.id3 &lt; 30)
&gt;     ) tbl
&gt; GROUP BY ALL
&gt; ;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐                             
       HASH_GROUP_BY                                    
                                             
             #0                                         
             #1                                         
             #2                                         
        corr(#3, #4)                                    
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
         PROJECTION                                     
                                             
            id2                                         
            id3                                         
         id3_right                                      
           value                                        
        value_right                                     
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
         PROJECTION                                     
                                             
             #1                                         
             #2                                         
             #3                                         
             #4                                         
             #5                                         
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
           FILTER                                       
                                             
     ((id3 - id3) &lt; 30)                                 
                                             
      EC: 24727992087                                   
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
         HASH_JOIN                                      
                                             
           INNER                                        
          id = id                                       
         id2 = id2         ├──────────────┐              
         id3 &gt; id3                                     
                                            
      EC: 24727992087                                  
     Cost: 24727992087                                 
└─────────────┬─────────────┘                                                         
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
         SEQ_SCAN          ││         SEQ_SCAN          
                ││                
             df            ││             df            
                ││                
             id            ││             id            
            id2            ││            id2            
            id3            ││            id3            
           value           ││           value           
                ││                
        EC: 5000000        ││        EC: 5000000        
└───────────────────────────┘└───────────────────────────┘    

We plan to address this in a future release.

Note also that the IEJoin algorithm requires two inequalities and the query only has one. Single inequalities could be handled by the PieceWiseMergeJoin operator, but PWMJ does not currently handle simple equalities (the logic would just have to be extended to handle NULLs correctly).

huangapple
  • 本文由 发表于 2023年6月15日 11:33:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76478913.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定