2023年2月6日 16:26:16go评论73阅读模式

英文:

What is the interpretation of rsd in pyspark's approx_count_distinct and what are the consequences of changing it?

问题

在pyspark的approx_count_distinct函数中，有一个名为rsd的精度参数。它是如何工作的？如果增加或减小它会有什么权衡？我猜想为了理解这一点，人们应该了解approx_count_distinct的实现方式。你能帮助我在approx_count_distinct的逻辑上理解rsd吗？

英文:

In the pyspark's approx_count_distinct function there is a precision argument rsd. How does it work? What are the tradeoffs if it is increased or decreased? I guess for this one should understand how approx_count_distinct is implemented. Can you help me understand rsd in the context of the logic of approx_count_distinct?

答案1

得分: 1

rsd 是“相对标准偏差”的缩写，其默认值为0.05。通过此值，您可以控制对不同计数的可接受误差。正如@Derek O在他们的评论中所描述的那样，approx_count_distinct 函数在准确性（可以使用 rsd 参数来控制）和计算速度之间进行权衡。

为了更好地理解底层算法，我们可以快速查看 approx_count_distinct 函数的实现。我们可以看到它使用 HyperLogLogPlusPlus 算法（这是 HyperLogLog 算法的改进版）。

  /**
   * 聚合函数：返回组中不同项的近似数量。
   *
   * @param rsd 允许的最大相对标准偏差（默认值 = 0.05）
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column, rsd: Double): Column = withAggregateFunction {
    HyperLogLogPlusPlus(e.expr, rsd, 0, 0)
  }

Apache Spark 的实现使用以下论文作为 HyperLogLogPlusPlus 算法的基础（在编写本帖时，Spark v3.3.1 版）：

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm（此链接已损坏，但出于完整性考虑我正在添加它）
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
HyperLogLog in Practice 的附录：State of the Art Cardinality Estimation Algorithm 的算法工程

英文:

rsd is an abbreviation of "relative standard deviation", and its default value is 0.05. With this value, you control what the error is you're willing to accept on the distinct count. As @Derek O described in their comment above, the approx_count_distinct function makes a tradeoff between accuracy (which you control using the rsd parameter) and speed of the calculation.

To understand the underlying algorithm a bit more, we can have a quick look at the implementation of the approx_count_distinct function. We see that it uses the HyperLogLogPlusPlus algorithm (an improvement over the HyperLogLog algorithm).

  /**
   * Aggregate function: returns the approximate number of distinct items in a group.
   *
   * @param rsd maximum relative standard deviation allowed (default = 0.05)
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column, rsd: Double): Column = withAggregateFunction {
    HyperLogLogPlusPlus(e.expr, rsd, 0, 0)
  }

Apache Spark's implementation of this HyperLogLogPlusPlus algorithm is based on the following papers (in Spark v3.3.1, time of writing this post):

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm (this link is broken but I'm adding it for completeness sake)
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm
Appendix to HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

rsd在pyspark的approx_count_distinct中的解释是什么，以及更改它会有什么后果？

问题

答案1

筛选与结果集中我已有的记录不相同的记录。

在子查询中使用CURRENT_TIMESTAMP是否有任何优势？

春季启动查询DTO

如何扩展查询，如果 SQL 查询是带参数的？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论