Polars在条件连接+分组/聚合上比DuckDB慢得多。

huangapple go评论98阅读模式
英文:

Polars is much slower than DuckDB in conditional join + groupby/agg context

问题

For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, 'DuckDB' gives much better performance than 'Polars' (~10x on a 32-core machine).

My questions are:

  1. What could be the potential reason(s) for the slowness (relative to 'DuckDB') of 'Polars'?
  2. Am I missing some other faster ways of doing the same thing in 'Polars'?
  1. import time
  2. import duckdb
  3. import numpy as np
  4. import polars as pl
  5. ## example dataframe
  6. rng = np.random.default_rng(1)
  7. nrows = 5_000_000
  8. df = pl.DataFrame(
  9. dict(
  10. id=rng.integers(1, 1_000, nrows),
  11. id2=rng.integers(1, 10, nrows),
  12. id3=rng.integers(1, 500, nrows),
  13. value=rng.normal(0, 1, nrows),
  14. )
  15. )
  16. ## polars
  17. start = time.perf_counter()
  18. res = (
  19. df.lazy()
  20. .join(df.lazy(), on=["id", "id2"], how="left")
  21. .filter(
  22. (pl.col("id3") > pl.col("id3_right"))
  23. & (pl.col("id3") - pl.col("id3_right") < 30)
  24. )
  25. .groupby(["id2", "id3", "id3_right"])
  26. .agg(pl.corr("value", "value_right"))
  27. .collect(streaming=True)
  28. )
  29. time.perf_counter() - start
  30. # 120.93155245436355
  31. ## duckdb
  32. start = time.perf_counter()
  33. res2 = (
  34. duckdb.sql(
  35. """
  36. SELECT df.*, df2.id3 as id3_right, df2.value as value_right
  37. FROM df JOIN df as df2
  38. ON (df.id = df2.id
  39. AND df.id2 = df2.id2
  40. AND df.id3 > df2.id3
  41. AND df.id3 - df2.id3 < 30)
  42. """
  43. )
  44. .aggregate(
  45. "id2, id3, id3_right, corr(value, value_right) as value",
  46. "id2, id3, id3_right",
  47. )
  48. .pl()
  49. )
  50. time.perf_counter() - start
  51. # 18.472263277042657

(Note: I've removed the HTML escape codes from the code snippet.)

英文:

For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, DuckDB gives much better performance than Polars (~10x on a 32-core machine).

My questions are:

  1. What could be the potential reason(s) for the slowness (relative to DuckDB) of Polars?
  2. Am I missing some other faster ways of doing the same thing in Polars?
  1. import time
  2. import duckdb
  3. import numpy as np
  4. import polars as pl
  5. ## example dataframe
  6. rng = np.random.default_rng(1)
  7. nrows = 5_000_000
  8. df = pl.DataFrame(
  9. dict(
  10. id=rng.integers(1, 1_000, nrows),
  11. id2=rng.integers(1, 10, nrows),
  12. id3=rng.integers(1, 500, nrows),
  13. value=rng.normal(0, 1, nrows),
  14. )
  15. )
  16. ## polars
  17. start = time.perf_counter()
  18. res = (
  19. df.lazy()
  20. .join(df.lazy(), on=[&quot;id&quot;, &quot;id2&quot;], how=&quot;left&quot;)
  21. .filter(
  22. (pl.col(&quot;id3&quot;) &gt; pl.col(&quot;id3_right&quot;))
  23. &amp; (pl.col(&quot;id3&quot;) - pl.col(&quot;id3_right&quot;) &lt; 30)
  24. )
  25. .groupby([&quot;id2&quot;, &quot;id3&quot;, &quot;id3_right&quot;])
  26. .agg(pl.corr(&quot;value&quot;, &quot;value_right&quot;))
  27. .collect(streaming=True)
  28. )
  29. time.perf_counter() - start
  30. # 120.93155245436355
  31. ## duckdb
  32. start = time.perf_counter()
  33. res2 = (
  34. duckdb.sql(
  35. &quot;&quot;&quot;
  36. SELECT df.*, df2.id3 as id3_right, df2.value as value_right
  37. FROM df JOIN df as df2
  38. ON (df.id = df2.id
  39. AND df.id2 = df2.id2
  40. AND df.id3 &gt; df2.id3
  41. AND df.id3 - df2.id3 &lt; 30)
  42. &quot;&quot;&quot;
  43. )
  44. .aggregate(
  45. &quot;id2, id3, id3_right, corr(value, value_right) as value&quot;,
  46. &quot;id2, id3, id3_right&quot;,
  47. )
  48. .pl()
  49. )
  50. time.perf_counter() - start
  51. # 18.472263277042657

答案1

得分: 3

EDIT: 2023-7-18

最新的 Polars 发布将差距从 15 倍降至 2 倍。

  1. polars v0.18.2 1125
  2. polars v0.18.3 140
  3. duckdb 0.8.2-dev1 75

Original answer

流处理引擎

流处理 API 还没有进行过优化。Polars 项目比 DuckDB 年轻,我们没有像 DuckDB 那样多的付费开发人员。

所以请给我们时间。下一个版本 0.18.3 将会推出一个可以使流式分组操作快 3.5 倍的 PR,详情请查看 https://github.com/pola-rs/polars/pull/9346

这只是显示了我们在流处理引擎方面还有多少工作要做。我们还需要对流式连接执行相同的优化。

简而言之,我们的流处理引擎处于 alpha 阶段,还在不断改进中。

不同的算法

除此之外,DuckDB 查询可能还在底层使用非等值连接,而这在 Polars 中尚未实现,因此对于 Polars 来说,这个查询可能不太优化。

英文:

EDIT: 2023-7-18

The latest polars release has brought the difference down from 15x to 2x.

  1. polars v0.18.2 1125
  2. polars v0.18.3 140
  3. duckdb 0.8.2-dev1 75

Original answer

Streaming engine

The streaming API isn't as optimized yet. Polars is a younger project than DuckDB and we haven't got as many paid developers on the project.

So give us time. Next release 0.18.3 will land a PR that can make a streaming groupby over 3.5x faster https://github.com/pola-rs/polars/pull/9346.

That just shows how much we still have on the table on the streaming engine. That same optimization we still have to do for streaming joins.

In short. Our streaming engine is in alpha stage. It is work in progress.

Different algorithm

Other that that, the duckdb query might also be using non-equi joins under the hood which we don't have yet in polars, so this query might not be as optimal for polars.

答案2

得分: 2

DuckDB目前虽然支持多个非等值连接,但查询规划器当前假设所有等值谓词都比不等值选择性更高,因此在这里生成哈希连接:

  1. D EXPLAIN SELECT id2, id3, id3_right, corr(value, value_right) as value
  2. &gt; FROM (
  3. &gt; SELECT df.*, df2.id3 as id3_right, df2.value as value_right
  4. &gt; FROM df JOIN df as df2
  5. &gt; ON (df.id = df2.id
  6. &gt; AND df.id2 = df2.id2
  7. &gt; AND df.id3 &gt; df2.id3
  8. &gt; AND df.id3 - df2.id3 &lt; 30)
  9. &gt; ) tbl
  10. &gt; GROUP BY ALL
  11. &gt; ;
  12. ┌─────────────────────────────┐
  13. │┌───────────────────────────┐│
  14. ││ Physical Plan ││
  15. │└───────────────────────────┘│
  16. └─────────────────────────────┘
  17. ┌───────────────────────────┐
  18. HASH_GROUP_BY
  19. #0
  20. #1
  21. #2
  22. corr(#3, #4)
  23. └─────────────┬─────────────┘
  24. ┌─────────────┴─────────────┐
  25. PROJECTION
  26. id2
  27. id3
  28. id3_right
  29. value
  30. value_right
  31. └─────────────┬─────────────┘
  32. ┌─────────────┴─────────────┐
  33. PROJECTION
  34. #1
  35. #2
  36. #3
  37. #4
  38. #5
  39. └─────────────┬─────────────┘
  40. ┌─────────────┴─────────────┐
  41. FILTER
  42. ((id3 - id3) &lt; 30)
  43. EC: 24727992087
  44. └─────────────┬─────────────┘
  45. ┌─────────────┴─────────────┐
  46. HASH_JOIN
  47. INNER
  48. id = id
  49. id2 = id2 ├──────────────┐
  50. id3 &gt; id3
  51. EC: 24727992087
  52. Cost: 24727992087
  53. └─────────────┬─────────────┘
  54. ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
  55. SEQ_SCAN ││ SEQ_SCAN
  56. ││
  57. df ││ df
  58. ││
  59. id ││ id
  60. id2 ││ id2
  61. id3 ││ id3
  62. value ││ value
  63. ││
  64. EC: 5000000 ││ EC: 5000000
  65. └───────────────────────────┘└───────────────────────────┘

我们计划在将来的版本中解决这个问题。

还请注意,IEJoin算法需要两个不等式,而查询只有一个。单一不等式可以由PieceWiseMergeJoin运算符处理,但PWMJ目前不处理简单的等式(逻辑只需扩展以正确处理NULL)。

英文:

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities an just generates a hash join here:

  1. D EXPLAIN SELECT id2, id3, id3_right, corr(value, value_right) as value
  2. &gt; FROM (
  3. &gt; SELECT df.*, df2.id3 as id3_right, df2.value as value_right
  4. &gt; FROM df JOIN df as df2
  5. &gt; ON (df.id = df2.id
  6. &gt; AND df.id2 = df2.id2
  7. &gt; AND df.id3 &gt; df2.id3
  8. &gt; AND df.id3 - df2.id3 &lt; 30)
  9. &gt; ) tbl
  10. &gt; GROUP BY ALL
  11. &gt; ;
  12. ┌─────────────────────────────┐
  13. │┌───────────────────────────┐│
  14. ││ Physical Plan ││
  15. │└───────────────────────────┘│
  16. └─────────────────────────────┘
  17. ┌───────────────────────────┐
  18. HASH_GROUP_BY
  19. #0
  20. #1
  21. #2
  22. corr(#3, #4)
  23. └─────────────┬─────────────┘
  24. ┌─────────────┴─────────────┐
  25. PROJECTION
  26. id2
  27. id3
  28. id3_right
  29. value
  30. value_right
  31. └─────────────┬─────────────┘
  32. ┌─────────────┴─────────────┐
  33. PROJECTION
  34. #1
  35. #2
  36. #3
  37. #4
  38. #5
  39. └─────────────┬─────────────┘
  40. ┌─────────────┴─────────────┐
  41. FILTER
  42. ((id3 - id3) &lt; 30)
  43. EC: 24727992087
  44. └─────────────┬─────────────┘
  45. ┌─────────────┴─────────────┐
  46. HASH_JOIN
  47. INNER
  48. id = id
  49. id2 = id2 ├──────────────┐
  50. id3 &gt; id3
  51. EC: 24727992087
  52. Cost: 24727992087
  53. └─────────────┬─────────────┘
  54. ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
  55. SEQ_SCAN ││ SEQ_SCAN
  56. ││
  57. df ││ df
  58. ││
  59. id ││ id
  60. id2 ││ id2
  61. id3 ││ id3
  62. value ││ value
  63. ││
  64. EC: 5000000 ││ EC: 5000000
  65. └───────────────────────────┘└───────────────────────────┘

We plan to address this in a future release.

Note also that the IEJoin algorithm requires two inequalities and the query only has one. Single inequalities could be handled by the PieceWiseMergeJoin operator, but PWMJ does not currently handle simple equalities (the logic would just have to be extended to handle NULLs correctly).

huangapple
  • 本文由 发表于 2023年6月15日 11:33:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76478913.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定